CSCI - 4146 - The Process of Data Science - Summer 2022

</center>

Assignment 3

</center>

Zesheng Jia
\<B00845993>

Initialization

Download dataset and files

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Books_5.json.gz

Import some packages

Hidden Cell: You may unfold this section for code details

In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
# import all related libraries
import pandas as pd
import warnings
import seaborn as sns
from datetime import datetime
from datetime import timedelta
import joblib
import datetime
import math
from sklearn.base import BaseEstimator, TransformerMixin

# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# For data preprocess
import numpy as np
import csv
import os

# For plotting
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

# get basemap for geographical plot
# from mpl_toolkits.basemap import Basemap

myseed = 42069  # set a random seed for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(myseed)
torch.manual_seed(myseed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(myseed)

# Numerical Operations
import math
import numpy as np

# Reading/Writing Data
import pandas as pd
import os
import csv

# For Progress Bar
from tqdm import tqdm

# Pytorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split

# For plotting learning curve
from torch.utils.tensorboard import SummaryWriter

# for matplotlib ploting
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import validation_curve
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import numpy
from sklearn import metrics

from sklearn.preprocessing import normalize
from sklearn.metrics import confusion_matrix

import nltk
# download the basic list of data and models
nltk.download('popular')

# download "book" collection of datasets from NLTK website
nltk.download("book")

from nltk.book import *
from nltk.corpus import stopwords    
stop_words = set(stopwords.words('english'))
[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/shakespeare.zip.
[nltk_data]    | Downloading package stopwords to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/stopwords.zip.
[nltk_data]    | Downloading package treebank to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/treebank.zip.
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/twitter_samples.zip.
[nltk_data]    | Downloading package omw to /root/nltk_data...
[nltk_data]    | Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]    | Downloading package wordnet to /root/nltk_data...
[nltk_data]    | Downloading package wordnet2021 to /root/nltk_data...
[nltk_data]    | Downloading package wordnet31 to /root/nltk_data...
[nltk_data]    | Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/wordnet_ic.zip.
[nltk_data]    | Downloading package words to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/words.zip.
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data]    | Downloading package punkt to /root/nltk_data...
[nltk_data]    |   Unzipping tokenizers/punkt.zip.
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | 
[nltk_data]  Done downloading collection popular
[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/dependency_treebank.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package ieer to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/ieer.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package nps_chat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/nps_chat.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package ppattach to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/ppattach.zip.
[nltk_data]    | Downloading package reuters to /root/nltk_data...
[nltk_data]    | Downloading package senseval to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/senseval.zip.
[nltk_data]    | Downloading package state_union to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/state_union.zip.
[nltk_data]    | Downloading package stopwords to /root/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package swadesh to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/swadesh.zip.
[nltk_data]    | Downloading package timit to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/timit.zip.
[nltk_data]    | Downloading package treebank to /root/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package toolbox to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/toolbox.zip.
[nltk_data]    | Downloading package udhr to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/udhr.zip.
[nltk_data]    | Downloading package udhr2 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/udhr2.zip.
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/unicode_samples.zip.
[nltk_data]    | Downloading package webtext to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/webtext.zip.
[nltk_data]    | Downloading package wordnet to /root/nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]    |   Package wordnet_ic is already up-to-date!
[nltk_data]    | Downloading package words to /root/nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/maxent_treebank_pos_tagger.zip.
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package maxent_ne_chunker is already up-to-date!
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/universal_tagset.zip.
[nltk_data]    | Downloading package punkt to /root/nltk_data...
[nltk_data]    |   Package punkt is already up-to-date!
[nltk_data]    | Downloading package book_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/book_grammars.zip.
[nltk_data]    | Downloading package city_database to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/city_database.zip.
[nltk_data]    | Downloading package tagsets to /root/nltk_data...
[nltk_data]    |   Unzipping help/tagsets.zip.
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection book
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

Jupyer notebook setting

Hidden Cell: You may unfold this section for code details

In [ ]:
sns.set()
# Set up with a higher resolution screen (useful on Mac)
%config InlineBackend.figure_format = 'retina'

###############################################
def max_print_out(pattern=False):
    '''It will maximize print out line and set float format with .2f'''
    number = None if pattern else 10
    # Set options to avoid truncation when displaying a dataframe
    pd.set_option("display.max_rows", number)
    pd.set_option("display.max_columns", 50)
    # Set floating point numbers to be displayed with 2 decimal places
    pd.set_option('display.float_format', '{:.2f}'.format)
    # for showing all entities 
    

Utility Function

Hidden Cell: You may unfold this section for code details

In [ ]:
################################  NEW FUNCTION IN AS3 ################################
#--------------remove_stop_words-------------
def remove_stop_words(data, stop_words):
  feature = data.select_dtypes(exclude="number").columns
  for i in range(len(feature)):
      print("Now it's removing stop words from ", feature[i])
      # remove stop words
      # first change all character to lower case
      data[feature[i]] = data[feature[i]].str.lower()
      data[feature[i]] = data[feature[i]].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
  return data 

#-----------get_model_set-------------
def get_model_set(data):
  # get our data set into features and labels
  X = data.iloc[:,:-1]
  y = data.iloc[:,-1:].values.ravel()
  y = y.astype(int)
  return X, y

#---------words_importance_plot---------------------
def words_importance_barplot(results, fig_size = (8,8)):
  fig, ax = plt.subplots(figsize = fig_size)
  results.boxplot(ax=ax)
  ax.set_ylabel('Importance')
  ax.set_title("Barplot of words' importance")

  
#--------------remove_num_non_letters-------------
def remove_num_non_letters(data):
  feature = data.select_dtypes(exclude="number").columns
  for i in range(len(feature)):
      print("Now it's removing num_non_letters from ", feature[i])
      # remove stop words
      # first change all character to lower case
      data[feature[i]] = data[feature[i]].str.replace('[^\w\s]+', '')
      data[feature[i]] = data[feature[i]].str.replace('[0-9]+', '')
  return data     



# def a function to draw the Bar plot
#----------------plot_frequenct_words_bar-------------------
def plot_frequenct_words_bar(data, figsize = (15,10), name = 'style'):
  fig, ax = plt.subplots(figsize = figsize)
  data.plot.barh(ax = ax)
  ax.set_title("Most frequent 50 words' distribution of " + str(name))
  ax.set_ylabel('Counts')

# def a function to get the report
#------------------frequent_words_reports---------------------
def frequent_words_reports(data, feature = 'style'):
  max_print_out(True)
  frequent_words = data[feature].str.split(expand=True).stack().value_counts().head(50) # get value accounts for words
  frequent_words = pd.DataFrame(frequent_words)
  plot_frequenct_words_bar(frequent_words)
  return frequent_words



#--------------- get_words_reports-------------
from tqdm.notebook import tqdm
def get_words_dictionary(data, column_number = 3):
  # initialize dictionary
  words_dictionary = {}
  # loop all instances 
  for i in tqdm(range(len(data))):
    # get text from each instances
    text_array  = data.iloc[i, column_number]
    # get words
    for text in text_array.split():
      # if the word doesn't exits in dictionary then set number to 1
      if words_dictionary.get(text) == None:
        words_dictionary[text] = 1
      # if the word alreadt exits, then add number 1
      else:
        words_dictionary[text] = words_dictionary.get(text) + 1 
  return words_dictionary


#-------------words_frequency_report--------------
def words_frequency_report(data, feature = 'reviewText',show_all=False,fig_size = (15,10)):
  # get column number
  column_number = data.columns.get_loc(feature)
  print("Start getting word reports by ", feature)
  words_dictionary = get_words_dictionary(data, column_number)
  print("### Finish get the words report")
  # get report 
  report = pd.DataFrame.from_dict(words_dictionary, orient='index')
  report.columns = ['counts']
  print("Load into Pandas dataFrame")
  # sort report
  report = report.sort_values(by=['counts'],  ascending=False)
  print("Sorting DataFrame")
  # decide if print all
  if show_all:
    report_head = report
  else:    
    # get first 50 columns
    report_head = report.head(50)

  print("Get report's top words\nStart plotting")
  # plot setting
  fig, ax = plt.subplots(figsize = fig_size)
  report_head.plot.barh(ax = ax)

  if show_all:
      ax.set_title("Distribution of " + str(feature))
  else:  
    ax.set_title("Most frequent 50 words' distribution of " + str(feature))
  ax.set_xlabel('Counts')
  ax.set_ylabel('Words')
  plt.gca().invert_yaxis()
  return report

from sklearn.feature_selection import SelectKBest

# ------------------feature selection-------------------------
def select_features_prompt(X_train, y_train, X_test,function):
    # configure to select all features
    fs = SelectKBest(score_func=function, k='all')
    # learn relationship from training data
    fs.fit(X_train, y_train)
    # transform train input data
    X_train_fs = fs.transform(X_train)
    # transform test input data
    X_test_fs = fs.transform(X_test)
    # what are scores for the features

    # print features' name and score
    for i in range(len(fs.scores_)):
        print(f'Feature {i}  {features_name[i]}: { fs.scores_[i]}' )
    return fs.scores_

#---------words_importance_plot---------------------
def words_importance_plot(results, fig_size = (15,10)):
  fig, ax = plt.subplots(figsize = fig_size)
  results.plot.barh(ax=ax)
  plt.gca().invert_yaxis()
  ax.set_ylabel('Importance')
  ax.set_title("Barplot of words' importance")


#--------------text_item_properties---------------#
'''We want to save all the results to a new dataframe'''
def text_item_properties(data):
  result = pd.DataFrame()
  data = data.copy()
  # we just fill na with 0 here. Without doing so, there will be an error
  data = data.fillna('0')
  for i in range(len(data.columns)):
    # get character length
    result['Text_length'] = data[str(data.columns[i])].str.len() 
    # get number of words
    result['num_of_words'] = data[str(data.columns[i])].str.split().str.len()
    # get non alphanumeric number
    result['presence_non_alphanumeric'] = data[str(data.columns[i])].str.replace('[a-zA-Z0-9 ]', '').str.len()
    # get stop words account
    result['stop_words_count'] = data[str(data.columns[i])].str.split().apply(lambda x: len(set(x) & stop_words)) 
  return result 
#---------------clean_useless_information---------------

def clean_useless_information(data_df, columns = ['reviewText']):
  data = data_df.copy()
  for i in range(len(columns)):
    # clean html tag
    data[columns[i]] = data[(columns[i])].str.replace('<[^<]+?>', '')
    # clean &nbsp
    data[(columns[i])] = data[(columns[i])].str.replace('&nbsp', '')
    # clean http URL
    data[(columns[i])] = data[(columns[i])].str.replace('http\S+', '')
    # clean line breaker
    data[(columns[i])] = data[(columns[i])].str.replace('\n', '')
    return data
    
#----------------TDIDF_Data_generator---------------
def TDIDF_Data_generator(data, max_features = 500):
  from sklearn.feature_extraction.text import TfidfVectorizer
  from sklearn.feature_extraction.text import TfidfTransformer
  # we set for max_features: as most words we saved for 500 words
  # in order to not exceed our memory
  v_test = TfidfVectorizer(stop_words='english', max_features = max_features)
  v_style = TfidfVectorizer(stop_words='english', max_features = 8)
  # get TDIDF array from token_text
  x_token_text = v_test.fit_transform(data['token_text'])
  # save it into a pandas dataframe
  tdidf_data = pd.DataFrame(x_token_text.toarray())
  # get TDIDF array from style
  x_style = v_style.fit_transform(data['style'])
  tdidf_data = pd.concat([tdidf_data, pd.DataFrame(x_style.toarray())], axis = 1)
  tdidf_data['verified'] = data['verified']
  tdidf_data['score'] = data['score']
  return tdidf_data
  

#----------------TDIDF_Data_generator_pos---------------
def TDIDF_Data_generator_pos(data, max_features = 1000, feature_name='only_noun'):
  from sklearn.feature_extraction.text import TfidfVectorizer
  from sklearn.feature_extraction.text import TfidfTransformer
  # we set for max_features: as most words we saved for 500 words
  # in order to not exceed our memory
  v_test = TfidfVectorizer(stop_words='english', max_features = max_features)
  # get TDIDF array from token_text
  x_token_text = v_test.fit_transform(data[feature_name])
  # save it into a pandas dataframe
  tdidf_data = pd.DataFrame(x_token_text.toarray(), columns = v_test.get_feature_names() )

  data_copy = data.copy()
  data_copy = data_copy.reset_index() # need to reset index for matching the results

  tdidf_data['verified'] = data_copy['verified']
  tdidf_data['score'] = data_copy['score']
  return tdidf_data


#-------------find_outliers-----------------
def find_outliers(data_df, parameter,* , drop=False, set_threshold=False, threshold_value = 350): # deal with outliers    
    '''detect and delete outliers '''
    # same with previous find_outliers function
    Q1 = data_df[parameter].quantile(0.25)
    Q3 = data_df[parameter].quantile(0.75)
    IQR = Q3-Q1
    
    print(f"IQR = {Q3} - {Q1} = {IQR}")
    print(f"MAX = {(Q3 + 1.5 * IQR)}")
    
    if Q1 > 1.5*IQR :
        print("Min: ", (Q1 - 1.5 * IQR))
    else:
        print("Min is 0")

    cut_out_value =  (Q3 + 1.5 * IQR) # normal outliers deleted
    # override the value if we set threshold
    if set_threshold == True:
        cut_out_value = threshold_value
    
    # get min outliers' index 
    # get max outliers' index
    min_outliers_df = data_df[(data_df[parameter] < (Q1 - 1.5 * IQR))]
    max_outliers_df = data_df[(data_df[parameter] > cut_out_value)]
    # get negtive outliers' index  
    negative_outliers_df = data_df[(data_df[parameter] <= 0)]         
    print("Num of min outliers: ", len(min_outliers_df))
    print("Num of max outliers: ", len(max_outliers_df))
    print("Num of negative outliers: ", len(negative_outliers_df))
    print("Num of the original data set's whole instance", len(data_df))
    print("Rate of purged data/total data", len(max_outliers_df)/ len(data_df))

    # It's pretty hard to drop multiple indexes at the same time
    # Because after one drop action, their index are changed from then
    # We need to alter the order of aboving codes. 
    # And it's pretty unnecessary for us to do this in our assignemnt
    # Since we don't have min outliers in this dataset
    # And negative values are not outliers
    # I decided to purge negative values in transformer instead of here
    return max_outliers_df.index


#---------------clean_useless_information---------------
def show_purged_reports(data_df, parameter = ['reviewText'], output_type = 'num_of_words'):
  data = data_df.copy() # get the copy
  # get our reports
  reports = text_item_properties( data.loc[:, parameter]);
  # find outliers
  index = find_outliers(reports, output_type);
  # plot the results
  ax = mulitple_function_plots(data=reports.drop(index), kde_type = False, plot_type="histogram",data_type="number", fig_size=(15,7),tight_layout=False)
  ax = mulitple_function_plots(data=reports.drop(index), kde_type= False , plot_type="boxplot",data_type="number", fig_size=(15,7) , tight_layout=False);
  return reports, index

#------------------------learning_curve------------------
def learning_curve(N, train_lc, val_lc):
  # set the figure size
  fig, ax = plt.subplots(figsize=(16, 6))
  # get the training score
  ax.plot(N, np.mean(train_lc, 1), color='blue', label='training score')
  # get the validation score
  ax.plot(N, np.mean(val_lc, 1), color='red', label='validation score')
  # draw the grid line
  ax.hlines(np.mean([train_lc[-1], val_lc[-1]]), N[0], N[-1],
                color='gray', linestyle='dashed')
  # graph setting up
  ax.set_ylim(0.5, 1.2)
  ax.set_xlim(N[0], N[-1])
  ax.set_xlabel('training size')
  ax.set_ylabel('Accuracy')
  ax.set_title("Random forest Accuracy Train/Valid of our final model")
  ax.legend(loc='best')
  fig.show()

#------------------------valid_score_curve------------------
def valid_score_curve(train_score, val_score, n_estimators = np.arange(1, 50)):
  fig, ax = plt.subplots(figsize=(16, 6))
  # get mean of 5 cv of values
  ax.plot(n_estimators, np.median(train_score, 1), color='blue', label='training score')
  ax.plot(n_estimators, np.median(val_score, 1), color='red', label='validation score')
  # matplot setting
  ax.legend(loc='best')
  ax.set_ylim(0.1, 1.2)
  ax.set_xlim(0, 50)
  ax.set_title("Train/Valid ACCURACY loss of different random forest models")
  ax.set_xlabel('number of trees')
  ax.set_ylabel('ACCURACY');
  plt.show()

#---------------clean_useless_information---------------
def show_reports(data_df, parameter = ['reviewText'], output_type = 'num_of_words'):
  data = data_df.copy() # get the copy
  # get our reports
  reports = text_item_properties( data.loc[:, parameter]);
  # plot the results
  ax = mulitple_function_plots(data=reports, kde_type = False, plot_type="histogram",data_type="number", fig_size=(15,7),tight_layout=False)
  ax = mulitple_function_plots(data=reports, kde_type= False , plot_type="boxplot",data_type="number", fig_size=(15,7) , tight_layout=False);
  return reports



################################  FUNCTION  ################################


    

#--------------------------Describe columns----------------------------------------

def describe_columns(data, features_name=[]):
    '''This function will help u print out features value counts'''
    if len(features_name) > 1:
        for i in range(len(features_name)):
            print("----------", data[features_name[i]].name,"---------")
            print(data[features_name[i]].value_counts())
    else:
        print("----------", data[features_name[0]].name,"---------")
        print(data[features_name[0]].value_counts())
        

#-------------Function from tutorial 2-----------------------------

def build_continuous_features_report(data_df):
    
    """Build tabular report for continuous features"""

    stats = {
        "Count": len,
        "Miss %": lambda df: df.isna().sum() / len(df) * 100,
        "Card.": lambda df: df.nunique(),
        "Min": lambda df: df.min(),
        "1st Qrt.": lambda df: df.quantile(0.25),
        "Mean": lambda df: df.mean(),
        "Median": lambda df: df.median(),
        "3rd Qrt": lambda df: df.quantile(0.75),
        "Max": lambda df: df.max(),
        "Std. Dev.": lambda df: df.std(),
    }

    contin_feat_names = data_df.select_dtypes("number").columns
    continuous_data_df = data_df[contin_feat_names]

    report_df = pd.DataFrame(index=contin_feat_names, columns=stats.keys())

    for stat_name, fn in stats.items():
        # NOTE: ignore warnings for empty features
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=RuntimeWarning)
            report_df[stat_name] = fn(continuous_data_df)

    return report_df
    
#-------------Function from tutorial 2---------------------------

def build_categorical_features_report(data_df):

    """Build tabular report for categorical features"""

    def _mode(df):
        return df.apply(lambda ft: ft.mode().to_list()).T

    def _mode_freq(df):
        return df.apply(lambda ft: ft.value_counts()[ft.mode()].sum())

    def _second_mode(df):
        return df.apply(lambda ft: ft[~ft.isin(ft.mode())].mode().to_list())

    def _second_mode_freq(df):
        return df.apply(
            lambda ft: ft[~ft.isin(ft.mode())]
            .value_counts()[ft[~ft.isin(ft.mode())].mode()]
            .sum()
        )

    stats = {
        "Count": len,
        "Miss %": lambda df: df.isna().sum() / len(df) * 100,
        "Card.": lambda df: df.nunique(),
        "Mode": _mode,
        "Mode Freq": _mode_freq,
        "Mode %": lambda df: _mode_freq(df) / len(df) * 100,
        "2nd Mode": _second_mode,
        "2nd Mode Freq": _second_mode_freq,
        "2nd Mode %": lambda df: _second_mode_freq(df) / len(df) * 100,
    }

    cat_feat_names = data_df.select_dtypes(exclude="number").columns
    continuous_data_df = data_df[cat_feat_names]

    report_df = pd.DataFrame(index=cat_feat_names, columns=stats.keys())

    for stat_name, fn in stats.items():
        # NOTE: ignore warnings for empty features
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=RuntimeWarning)
            report_df[stat_name] = fn(continuous_data_df)

    return report_df


#-------------One function to plot multiple kinds of graph ---------------------------

# All codes were written by myself

# set keyword only parameter. Since we have 3 options to plot. plot function and data type must be indentified before ploting

def mulitple_function_plots(tight_layout = True, h_space = 0.4,w_space=0.3, columns = 2,  fig_size = (10,15) , kde_type = True,
                            *,data,plot_type="histogram",data_type="number"):
    
    '''Plot all features from the dataset, you must specified your dataset by, data = '''
    if data_type == "number":
        feat_names = data.select_dtypes("number").columns 
    elif data_type == "categorical":
        feat_names = data.select_dtypes(exclude="number").columns
        
    # seperate those features into 2 columns
    rows_number = math.ceil(len(feat_names)/columns)
    
    print("Those features will be plotted in ", rows_number, " rows and ", columns , "columns")
    # print continuous features name
    print(feat_names)
    
    #initialize figure
    fig, axs = plt.subplots(rows_number, columns, figsize=fig_size)
    index = 0
    start = datetime.datetime.now()
    
    #print
    for i in range(rows_number):
        for j in range(columns):
            if index < len(feat_names):
                if plot_type == 'histogram': # shortcut for histogram plot
                    sns.histplot(data=data, x=feat_names[index], bins = 30,kde=kde_type, ax=axs[i][j])
                elif plot_type == 'boxplot': # boxplot
                    data.boxplot(column=feat_names[index],ax=axs[i][j], vert=False)
                elif plot_type == 'barplot': # barplot
                    data[feat_names[index]].value_counts().plot.bar(ax=axs[i][j],rot=0);
                # set corresponded name of selected features
                axs[i][j].set_xlabel(feat_names[index])
                # end of calculating the time
                end = datetime.datetime.now()
                # print info
                print(index+1, ". Finish Rendering :", feat_names[index],", used",  
                      (end - start).seconds, "millseconds")
                index += 1
            else:
                break
    #adjust pictures
    plt.subplots_adjust(hspace = h_space,wspace=w_space)
    # add figure title
    fig.suptitle(str(plot_type.title() + " of all " + data_type.title() + " features"), fontweight ="bold")
    # set whether we want to plot a tight_layout figure
    if tight_layout:
        fig.tight_layout()
        fig.subplots_adjust(top=0.95)
    return axs


#------------- draw heatmap -----------------------------------------------------

def heatmap_draw(data):
    # Correlation between different variables
    corr = data.corr()
    # Set up the matplotlib plot configuration
    f, ax = plt.subplots(figsize=(12, 10))
    # Configure a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)
    # Draw the heatmap
    sns.heatmap(corr, annot=True, cmap=cmap)
    plt.title("Heatmap correlation among all features")


import matplotlib.pyplot as plt
import numpy
from sklearn import metrics



############################FUNCTIONS ON PIPELINE##########################################


#--------------purge_NaN-------------------
def purge_NaN(data_df):
  data = data_df.copy()
  # drop vote and image
  data = data.drop(['vote','image'], axis = 1)
  # drop NaN values
  for i in range(len(data.columns)):
    data = data.drop(data[data[str(data.columns[i])].isna()].index)
  return data



#---------------clean_useless_information---------------
def purge_outliers(data_df, parameter = 'reviewText', output_type = 'num_of_words'):
  data = data_df.copy() # get the copy
  # find outliers
  result = pd.DataFrame()
  result[output_type] = data[parameter].str.split().str.len()
  index = find_outliers(result, output_type);
  # plot the results
  return data.drop(index)


  

################################  CLASS  ################################




#------------- main transformer ---------------------
# Class for attribute transformer
# import important libray
from sklearn.base import BaseEstimator, TransformerMixin

class combined_attribute_adder_and_cleaner(BaseEstimator, TransformerMixin):
    '''data clean transfomer class'''
    
    def __init__(self, data_cleaner = True, servies_remainer = False, normalization = True): # no *args or **kargs
        # we need to set extra var to ensure do we need to purge the dataset. 
        # In my following experments, sometimes we don't need to do so. 
        self.data_cleaner = data_cleaner
        self.servies_remainer = servies_remainer
        self.normalization = normalization

    def fit(self, X, y=None):
        return self # nothing else to do

    def transform(self, data_df):
        # we first copy the data from our dataset.
        # operate on original data set sometimes is dangerous.
        X = data_df.copy()

        #0. drop NaN values
        # drop vote and image
        X = X.drop(['vote','image'], axis = 1)
        # drop NaN values
        for i in range(len(X.columns)):
          X = X.drop(X[X[str(X.columns[i])].isna()].index)

        # 1. First we change the feature verified with to integer
        X["verified"] = X["verified"].astype(int)

        # 2. purge outliers
        X = purge_outliers(X)

        # 3. drop all useless features and categorical features we alreayd transfered
        X = X.drop(['reviewerID','reviewTime', 'asin', 'unixReviewTime'],axis=1) 

        # 4. delete HTML tag and other useless characters
        X = clean_useless_information(X)

        # 5. clean alphanumeric data
        X['style'] = X['style'].str.replace('Format', '')

        # get text feature
        feature = X.select_dtypes(exclude="number").columns

        for i in range(len(feature)):
            print("Now it's removing number and alphanumberic from ", feature[i])
            # remove stop words
            # first change all character to lower case
            X[feature[i]] = X[feature[i]].str.replace('[^\w\s]+', '')
            X[feature[i]] = X[feature[i]].str.replace('[0-9]+', '')

        # remove stop words
        stop_words = stopwords.words('english')

        for i in range(len(feature)):
          print("Now it's removing stop words from ", feature[i])
          # remove stop words
          # first change all character to lower case
          X[feature[i]] = X[feature[i]].str.lower()
          X[feature[i]] = X[feature[i]].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

        # create new column
        X['text'] = X['summary'] + " " + X['reviewText']

        #6. clean style's space
        X['style'] = X['style'].str.replace(' ', '')
        
        # we put our target value at the end
        target = X.pop('overall')
        X['score'] = target


        return X
#############################PIPE LINE###################################################



# Now we build a transformer to get all the above steps
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


# convert_pipeline is for create a whole pipeline but remain the dataFrame structure
convert_pipeline = Pipeline([
        ('attribs_adder_cleaner', combined_attribute_adder_and_cleaner(data_cleaner=True)),
    ])











# ensure the random seed, that our result won't be really random
def same_seed(seed):
    '''Fixes random number generator seeds for reproducibility.'''
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    torch.manual_seed(seed)
    # set cuda seed
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
##############
# split dataset
def train_valid_split(data_set, valid_ratio, seed):
    '''Split provided training data into training set and validation set'''
    # split dataset into train, validation set by ratio
    valid_set_size = int(valid_ratio * len(data_set))
    # get the rest of the set
    train_set_size = len(data_set) - valid_set_size
    # random split the set by what we defined before
    train_set, valid_set = random_split(data_set, [train_set_size, valid_set_size], generator=torch.Generator().manual_seed(seed))
    # return the np array for training
    return np.array(train_set), np.array(valid_set)

#### function of make the neural network prediction
def predict(test_loader, model, device):
    model.eval() # Set your model to evaluation mode.
    preds = []
    for x in tqdm(test_loader): # use tqdm to show the progress
        x = x.to(device)
        with torch.no_grad(): # before do prediction, we need to turn off gradient decent
            pred = model(x)
            preds.append(pred.detach().cpu()) # no need to use GPU for prediction
    preds = torch.cat(preds, dim=0).numpy() # concatenate the results
    return preds


def plot_learning_curve(loss_record, title='', type = 'acc' ,  y_start = 0., y_end = 1, ylabel='Accuracy', figsize = (17,10), x_start = 0, x_end = 2000):
    ''' Plot learning curve of your DNN (train & dev loss) '''

    accuracy_label = ['train_acc', 'valid_acc']
    loss_label = ['train_loss', 'valid_loss']

    if type == 'acc':
      plot_selection = accuracy_label
    else:
      plot_selection = loss_label

    x_end =  len(loss_record[plot_selection[0]])
    total_steps = len(loss_record[plot_selection[0]]) # get the length of our records
    x_1 = range(total_steps) # get the range of x
    x_2 = x_1[::len(loss_record[plot_selection[0]]) // len(loss_record['valid_acc'])]
    figure(figsize=figsize) # set figsize
    plt.plot(x_1, loss_record[plot_selection[0]], c='tab:red', label='train loss')
    plt.plot(x_2, loss_record[plot_selection[1]], c='tab:cyan', label='validation loss')
    plt.ylim(y_start, y_end) # set limit on x axis
    plt.xlim(x_start, x_end) # set limit on y axis
    plt.xlabel('Training steps')
    plt.ylabel(ylabel)
    plt.title('Learning curve of {}'.format(title))
    plt.legend()
    plt.show()

def plot_pred(dv_set, model, device, lim=360., preds=None, targets=None,figsize=(15,15)):
    ''' Plot prediction of your DNN '''
    if preds is None or targets is None:
        model.eval()
        preds, targets = [], [] # do prediction
        for x, y in dv_set:
            x, y = x.to(device), y.to(device)
            with torch.no_grad():
                pred = model(x)
                preds.append(pred.detach().cpu())
                targets.append(y.detach().cpu())
        preds = torch.cat(preds, dim=0).numpy() # save the prediction
        targets = torch.cat(targets, dim=0).numpy() # save the target value
    # matplot setting
    figure(figsize = figsize)
    plt.scatter(targets, preds, c='r', alpha=0.5)
    plt.plot([-0.2, lim], [-0.2, lim], c='b')
    plt.xlim(-0.2, lim)
    plt.ylim(-0.2, lim)
    plt.xlabel('ground truth value')
    plt.ylabel('predicted value')
    plt.title('Ground Truth v.s. Prediction')
    plt.show()





# Dataset class
class Dataset_container(Dataset):
    '''
    x: Features.
    y: Targets, if none, do prediction.
    '''
    def __init__(self, x, y=None):
        if y is None:
            self.y = y # return prediction
        else:
            self.y = torch.LongTensor(y) # get target value
        self.x = torch.FloatTensor(x) # get features

    def __getitem__(self, idx):
        if self.y is None:
            return self.x[idx] # get features
        else:
            return self.x[idx], self.y[idx] # get features and target

    def __len__(self):
        return len(self.x) # return length


def select_feat(train_data, valid_data, test_data, select_all=True):
    '''Selects useful features to perform regression'''
    # because we are operating on np arrays, we will alter the dataset and assume the last column is our target
    y_train, y_valid, y_test= train_data[:,-1], valid_data[:,-1],test_data[:,-1]
    # any columns before the last column is our features x
    raw_x_train, raw_x_valid, raw_x_test = train_data[:,:-1], valid_data[:,:-1], test_data[:,:-1]
    # Hyperparameter setting
    # select all for selecting all features
    if select_all:
        feat_idx = list(range(raw_x_train.shape[1]))
    else:
        # specific the feature index
        feat_idx = [0,1,2,3,4]
    #return the datasets
    return raw_x_train[:,feat_idx], raw_x_valid[:,feat_idx], raw_x_test[:,feat_idx], y_train, y_valid, y_test



device = 'cuda' if torch.cuda.is_available() else 'cpu'
config = {
    'seed': 5201314,      # Your seed number, you can pick your lucky number. :)
    'select_all': True,   # Whether to use all features.
    'valid_ratio': 0.42857143,   # validation_size = train_size * valid_ratio
    'n_epochs': 2000,     # Number of epochs. Try 2000 at first
    'batch_size': 512,   # since we have a quite large dataset. The batch size should be large too
    'learning_rate': 1e-3,
    'early_stop': 50,    # If model has not improved for this many consecutive epochs, stop training.
    'save_path': './models/model.ckpt'  # Your model will be saved here.
}


    # prepara for training data oversampling

from sklearn.datasets import make_classification

from imblearn.over_sampling import RandomOverSampler
from collections import Counter


def create_data_loader(model_train, model_test, oversampling = True):
  # Set seed for reproducibility
  same_seed(config['seed'])

  test_data = model_test.values
  # get training data
  train_data = model_train

  ###################################################################################
  ##################Oversampling on training data only###############################

  if oversampling == True:

      # split to trianing set and testing set
    train_data, valid_data = train_test_split(train_data, test_size = config['valid_ratio'], random_state = config['seed'])  
    print(f'original train_data with out oversampling size: {train_data.shape}')
    ros = RandomOverSampler(random_state=0) # RandomOverSampler
    X_resampled, y_resampled = ros.fit_resample(train_data.iloc[:,:-1], train_data.iloc[:,-1:]) # get the features and labels
    # get oversampled train data
    train_data = pd.concat([X_resampled,y_resampled], axis = 1) # concate features and labels
    
  else:
    train_data, valid_data = train_test_split(train_data, test_size = config['valid_ratio'], random_state = config['seed'])  


  ##################End of Oversampling on training data#############################
  ###################################################################################
  train_data = train_data.astype(float).values
  valid_data = valid_data.astype(float).values
  test_data = test_data.astype(float)
  # fool proof for unsuccessful data preparation stage.

  # Print out the data size.
  print(f"""train_data size: {train_data.shape}
  valid_data size: {valid_data.shape}
  test_data size: {test_data.shape}""")

  # Select features
  x_train, x_valid, x_test, y_train, y_valid, y_test = select_feat(train_data, valid_data, test_data, config['select_all'])

  # Print out the number of features.
  print(f'number of features: {x_train.shape[1]}')

  train_dataset, valid_dataset, test_dataset = Dataset_container(x_train, y_train), \
                                              Dataset_container(x_valid, y_valid), \
                                              Dataset_container(x_test,y_test)

  # Pytorch data loader loads pytorch dataset into batches.
  train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
  valid_loader = DataLoader(valid_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
  test_loader = DataLoader(test_dataset, batch_size=config['batch_size'], shuffle=False, pin_memory=True)
  return train_loader, valid_loader, test_loader




#------------------------learning_curve------------------
def learning_curve(N, train_lc, val_lc):
  # set the figure size
  fig, ax = plt.subplots(figsize=(16, 6))
  # get the training score
  ax.plot(N, np.mean(train_lc, 1), color='blue', label='training score')
  # get the validation score
  ax.plot(N, np.mean(val_lc, 1), color='red', label='validation score')
  # draw the grid line
  ax.hlines(np.mean([train_lc[-1], val_lc[-1]]), N[0], N[-1],
                color='gray', linestyle='dashed')
  # graph setting up
  ax.set_ylim(0.5, 1.2)
  ax.set_xlim(N[0], N[-1])
  ax.set_xlabel('training size')
  ax.set_ylabel('Accuracy')
  ax.set_title("Random forest Accuracy Train/Valid of our final model")
  ax.legend(loc='best')
  fig.show()


#------------------------valid_score_curve------------------
def valid_score_curve(train_score, val_score, n_estimators = np.arange(1, 50)):
  fig, ax = plt.subplots(figsize=(16, 6))
  # get mean of 5 cv of values
  ax.plot(n_estimators, np.median(train_score, 1), color='blue', label='training score')
  ax.plot(n_estimators, np.median(val_score, 1), color='red', label='validation score')
  # matplot setting
  ax.legend(loc='best')
  ax.set_ylim(0.6, 1.2)
  ax.set_xlim(0, 50)
  ax.set_title("Train/Valid ACCURACY loss of different random forest models")
  ax.set_xlabel('number of trees')
  ax.set_ylabel('ACCURACY');
  plt.show()



import matplotlib.pyplot as plt
import numpy
from sklearn import metrics

from sklearn.preprocessing import normalize
from sklearn.metrics import confusion_matrix


#------------------------draw_confusion_matrix------------------
def draw_confusion_matrix_testing(y_test, pred):
  # calculate the confusion matrix
  confusion_matrix = metrics.confusion_matrix(y_test, pred)
  cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, display_labels = [True, False])
  cm_display.plot()
  # matplot setting
  cm_display.ax_.set_xlabel("Prediciton")
  cm_display.ax_.set_ylabel("Churn")
  cm_display.ax_.set_title("Confusion matrix of Churn vs prediciton")
  plt.show()

#------------------------draw_normalized_confusion_matrix------------------
def draw_normalized_confusion_matrix_testing(y_true, y_pred, fig_size = (10,7)):

  #initialize figure
  fig, axs = plt.subplots(figsize=fig_size)
  x = confusion_matrix(y_true, y_pred)
  x_normed = normalize(x, axis=1, norm='l1')
  sns.heatmap(x_normed, annot=True, fmt='g', ax = axs);  #annot=True to annotate cells, ftm='g' to disable scientific notation
  # labels, title and ticks
  axs.set_xlabel("Prediction");axs.set_ylabel('Churn');
  axs.set_title('Confusion Matrix of Churn vs Prediction');
  axs.xaxis.set_ticklabels(['True', 'False']); axs.yaxis.set_ticklabels(['True', 'False']);
  fig.tight_layout()
  fig.subplots_adjust(top=0.95)


#-------------------test_scores_on_two_models-------------------

def test_scores_on_two_models(test_loader, pred_1, pred_2, function, type = 'Accuracy'):
  RF_score = []
  NN_score = []
  for i in range(10):
    RF_score.append(function(test_loader[i].dataset.y.numpy(), pred_1[i]))
    NN_score.append(function(test_loader[i].dataset.y.numpy(), pred_2[i]))

  results = pd.DataFrame()
  results['randomForest'] = RF_score
  results['NeuralNetwork'] = NN_score

  figure(figsize=(8, 6), dpi=100)
  ax = sns.boxplot(data=results)
  plt.ylabel(str(type))
  plt.title('Boxplot of '+ str(type)+ 'between Randomforest and DNN')
  ks_2samp_test(results, 'randomForest', 'NeuralNetwork')
  return RF_score, NN_score



from scipy.stats import ks_2samp
#-------------------ks_2samp_test-------------------
def ks_2samp_test(data, param1='randomForest', param2='NeuralNetwork'):
	max_print_out(True)
	value, pvalue = ks_2samp(data[param1].values,data[param2].values)
	print("##################### p-value = ", pvalue, "####################")
	if pvalue > 0.05:
		print('Samples are likely drawn from the same distributions (fail to reject H0)')
	else:
		print('##################### Samples are likely drawn from different distributions (reject H0)####################')

1. Task 1: Data understanding (0.1)

Start up

  1. Load data and create datasets.
  2. Build the data quality report.

  3. Identify data quality issues and build the data quality plan.

  4. Preprocess your data according to the data quality plan.

  5. Answer the following questions:

    1. What is the distribution of the top 50 most frequent words (excluding the stop words) for each of the textual features?
    2. What is the proportion of each format in the dataset?
    3. What is the most/least common format of the books?
    4. What patterns can you find in your data? E.g., if you look at the counts for each overall score, people tend to give more positive reviews than negatives. (you are encouraged to find different patterns to the one proposed here as an example)

1.0 Load Original Data and create a 1 million subset (Bonus Mark task)

Since the original dataset is too larger, there is no FREE resources we can use to load the whole original dataset into memory.

Hence, we use pandas with chunk size == 2.5 millions to get a thrid chunk of the original dataset then ramdomly sample it into 1 million subset.

Why 2.5 million?

Well, my device and Kaggle notebook can not afford anymore instances in their memory.

Write a few codes to get our 2.5 millions instances from the middle of our orignial dataset.

In [ ]:
# create a container for our subsets
subset =[]
# separate the whole dataset by chunks = 2,500,000 (2.5 millions subsets)
# we will randomly sample 1 million subsets from it
with pd.read_json('./Books_5.json.gz', lines=True, chunksize=2500000) as reader: 
    index = 0
    for subset in reader: # get subsets by reader
        print("Now, it's loading chunk ", index)
        # we don't take the first 2.5 millions dataset
        index += 1 # set index for skipping the first a few millions instances
        if index == 3: # we take the third 2.5 million instances as our subset
            break
        del subset # clean memory manually
Now, it's loading chunk  0
Now, it's loading chunk  1
Now, it's loading chunk  2

Print the subset's head

In [ ]:
subset.head()
Out[ ]:
overall verified reviewTime reviewerID asin style reviewerName reviewText summary unixReviewTime vote image
5000000 3 True 10 22, 2012 A29L3RMRY5GDPO 0356500586 {'Format:': ' Kindle Edition'} Emily H. This seems like the author was inspired by Tru... Good But A Bit Too Much Like True Blood 1350864000 NaN NaN
5000001 5 True 10 12, 2012 ARQR72CBQYCJ9 0356500586 {'Format:': ' Kindle Edition'} Kindle Customer This is an auto mechanic like none I have ever... Moon Magic 1350000000 NaN NaN
5000002 5 True 02 28, 2017 A3GWE80SUGORJD 0373004559 {'Format:': ' Kindle Edition'} Bette Hansen When I found out there was a Kowalski Family r... I'm happy to say that Shannon Stacey did not d... 1488240000 NaN NaN
5000003 4 False 02 28, 2017 A23M3HDG0IWHWB 0373004559 {'Format:': ' Kindle Edition'} Bobbie This was a fun quick read about a family ( the... Fun and Enjoyable 1488240000 NaN NaN
5000004 4 False 02 28, 2017 A2NF7W3NOVHO5O 0373004559 {'Format:': ' Kindle Edition'} Lisa M. I was so excited when I found out that there w... Entertaining! 1488240000 NaN NaN

Convert raw dataset dtypes and print the info

In [ ]:
subset = subset.convert_dtypes()
subset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500000 entries, 5000000 to 7499999
Data columns (total 12 columns):
 #   Column          Dtype  
---  ------          -----  
 0   overall         Int64  
 1   verified        boolean
 2   reviewTime      string 
 3   reviewerID      string 
 4   asin            string 
 5   style           object 
 6   reviewerName    string 
 7   reviewText      string 
 8   summary         string 
 9   unixReviewTime  Int64  
 10  vote            string 
 11  image           object 
dtypes: Int64(2), boolean(1), object(2), string(7)
memory usage: 219.3+ MB

We can see that style is an object.

Before we do anything to it. We create a copy feature of it and convert it to string type

In [ ]:
subset['style_str'] = subset['style'].astype('str') 
# show value counts
subset.style_str.value_counts()
Out[ ]:
{'Format:': ' Kindle Edition'}           1213279
{'Format:': ' Paperback'}                 514457
{'Format:': ' Hardcover'}                 454139
{'Format:': ' Mass Market Paperback'}     238562
{'Format:': ' Board book'}                 23113
                                          ...   
{'Format:': ' Video Game'}                     1
{'Format:': ' Single Issue Magazine'}          1
{'Format:': ' DVD Audio'}                      1
{'Color:': ' gold'}                            1
{'Format:': ' Bath Book'}                      1
Name: style_str, Length: 79, dtype: int64

It's indeed a string type.

In [ ]:
subset.describe()
Out[ ]:
overall verified unixReviewTime
count 2.500000e+06 2500000 2.500000e+06
unique NaN 2 NaN
top NaN True NaN
freq NaN 1674784 NaN
mean 4.324232e+00 NaN 1.376941e+09
std 1.030883e+00 NaN 1.182797e+08
min 1.000000e+00 NaN 8.356608e+08
25% 4.000000e+00 NaN 1.356048e+09
50% 5.000000e+00 NaN 1.407974e+09
75% 5.000000e+00 NaN 1.451693e+09
max 5.000000e+00 NaN 1.525478e+09

Now, let's use random split to get 1 millioni datasets from our 2.5 million dataset.

In [ ]:
from sklearn.model_selection import train_test_split
final_subset, _ = train_test_split(subset, test_size=0.6, random_state=42)

Print subset's info

In [ ]:
final_subset.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 6746618 to 7219110
Data columns (total 13 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   overall         1000000 non-null  Int64  
 1   verified        1000000 non-null  boolean
 2   reviewTime      1000000 non-null  string 
 3   reviewerID      1000000 non-null  string 
 4   asin            1000000 non-null  string 
 5   style           994508 non-null   object 
 6   reviewerName    999978 non-null   string 
 7   reviewText      999868 non-null   string 
 8   summary         999862 non-null   string 
 9   unixReviewTime  1000000 non-null  Int64  
 10  vote            217191 non-null   string 
 11  image           1559 non-null     object 
 12  style_str       1000000 non-null  object 
dtypes: Int64(2), boolean(1), object(3), string(7)
memory usage: 103.0+ MB

Now, we have our self-generated data.

Let's make our dataset as same as our kaggle subset's structure.

In [ ]:
raw_data = final_subset
raw_data = raw_data.convert_dtypes()
raw_data = raw_data.drop('Unnamed: 0', axis=1) # old index is useless now. drop it

Print our new generated data columns for more details.

In [ ]:
raw_data.columns
Out[ ]:
Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style',
       'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'vote',
       'image', 'style_str'],
      dtype='object')
In [ ]:
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 13 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   overall         1000000 non-null  Int64  
 1   verified        1000000 non-null  boolean
 2   reviewTime      1000000 non-null  string 
 3   reviewerID      1000000 non-null  string 
 4   asin            1000000 non-null  string 
 5   style           994508 non-null   string 
 6   reviewerName    999938 non-null   string 
 7   reviewText      999867 non-null   string 
 8   summary         999859 non-null   string 
 9   unixReviewTime  1000000 non-null  Int64  
 10  vote            217191 non-null   object 
 11  image           1559 non-null     string 
 12  style_str       994508 non-null   string 
dtypes: Int64(2), boolean(1), object(1), string(9)
memory usage: 95.4+ MB

we can see that vote only has 217191 instances, there are too many NaN values, we don't convert it type for now. SInce we will drop this column later.

In [ ]:
# we don't need this self generated column anymore
raw_data = raw_data.drop('style_str',axis = 1)
In [ ]:
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 12 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   overall         1000000 non-null  Int64  
 1   verified        1000000 non-null  boolean
 2   reviewTime      1000000 non-null  string 
 3   reviewerID      1000000 non-null  string 
 4   asin            1000000 non-null  string 
 5   style           994508 non-null   string 
 6   reviewerName    999938 non-null   string 
 7   reviewText      999867 non-null   string 
 8   summary         999859 non-null   string 
 9   unixReviewTime  1000000 non-null  Int64  
 10  vote            217191 non-null   object 
 11  image           1559 non-null     string 
dtypes: Int64(2), boolean(1), object(1), string(8)
memory usage: 87.7+ MB
In [ ]:
# save the final version of data
raw_data.to_csv('self_generated_dataset.csv')

ReLoad 1 million self generated subset for convenient

I compared my 1 million dataset with kaggle dataset personally. The result is fine. I won't attach those here. Since we can see in the describe() function that they are similary distribtution of overall scores.

In [ ]:
raw_data = pd.read_csv('/content/drive/MyDrive/A3/self_generated_dataset.csv')
raw_data = raw_data.convert_dtypes()
raw_data = raw_data.drop('Unnamed: 0', axis=1) # old index is useless now. drop it
raw_data.head()
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882: DtypeWarning: Columns (11) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)
Out[ ]:
overall verified reviewTime reviewerID asin style reviewerName reviewText summary unixReviewTime vote image
0 2 False 12 4, 2015 A273QRPDN6IQC8 0446676101 {'Format:': ' Kindle Edition'} Rub Chicken Book starts out with some really interesting i... ... with some really interesting ideas and get... 1449187200 NaN <NA>
1 5 True 04 24, 2016 A31Q39MDPVBTSX 0451473019 {'Format:': ' Hardcover'} BluegrassAnne was a gift for someone. he loved it he loved 1461456000 NaN <NA>
2 4 True 10 30, 2014 A353XVWAOOUCQS 0385352107 {'Format:': ' Kindle Edition'} Reader in the Pacific I have read a number of Murakami novels and th... Hollow Man 1414627200 NaN <NA>
3 5 True 10 9, 2013 A1DPPR0FYZ8B4A 042525609X {'Format:': ' Paperback'} Wonder Woman This book is a stand alone read as some have m... Letting go of demons...to embrace deserved love. 1381276800 NaN <NA>
4 5 False 08 15, 2015 A2XS90TMQ26YYH 0399146695 {'Format:': ' Kindle Edition'} J. Williams Great story! Highly recommended! Five Stars 1439596800 NaN <NA>

1.1 Build the data quality report (0.1)

Print our self-generated data's info

In [ ]:
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 12 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   overall         1000000 non-null  Int64  
 1   verified        1000000 non-null  boolean
 2   reviewTime      1000000 non-null  string 
 3   reviewerID      1000000 non-null  string 
 4   asin            1000000 non-null  string 
 5   style           994508 non-null   string 
 6   reviewerName    999938 non-null   string 
 7   reviewText      999867 non-null   string 
 8   summary         999859 non-null   string 
 9   unixReviewTime  1000000 non-null  Int64  
 10  vote            217191 non-null   object 
 11  image           1559 non-null     string 
dtypes: Int64(2), boolean(1), object(1), string(8)
memory usage: 87.7+ MB

We can see there are mulitple features have missing value. Vote and image has too many NaN values, we have to drop those 2 columns. The rest are style, reviewerName, reviewText, summary. There aren't too many of NaN values, we can handle it one by one.

We will handle those later.

In [ ]:
# We set our print out line limis to maximum and set string out print format [.2f ]
max_print_out(True)
# describe continous features summary
raw_data.describe()
Out[ ]:
overall verified unixReviewTime
count 1000000.00 1000000 1000000.00
unique NaN 2 NaN
top NaN True NaN
freq NaN 670120 NaN
mean 4.32 NaN 1376980158.07
std 1.03 NaN 118187110.14
min 1.00 NaN 849225600.00
25% 4.00 NaN 1356220800.00
50% 5.00 NaN 1408060800.00
75% 5.00 NaN 1451692800.00
max 5.00 NaN 1525478400.00

1.1.1 Continuous features report

Reuse the function from Tutorial

In [ ]:
#-------------Function from tutorial 2-----------------------------

def build_continuous_features_report(data_df):
    
    """Build tabular report for continuous features"""

    stats = {
        "Count": len,
        "Miss %": lambda df: df.isna().sum() / len(df) * 100,
        "Card.": lambda df: df.nunique(),
        "Min": lambda df: df.min(),
        "1st Qrt.": lambda df: df.quantile(0.25),
        "Mean": lambda df: df.mean(),
        "Median": lambda df: df.median(),
        "3rd Qrt": lambda df: df.quantile(0.75),
        "Max": lambda df: df.max(),
        "Std. Dev.": lambda df: df.std(),
    }

    contin_feat_names = data_df.select_dtypes("number").columns
    continuous_data_df = data_df[contin_feat_names]

    report_df = pd.DataFrame(index=contin_feat_names, columns=stats.keys())

    for stat_name, fn in stats.items():
        # NOTE: ignore warnings for empty features
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=RuntimeWarning)
            report_df[stat_name] = fn(continuous_data_df)

    return report_df
    
In [ ]:
max_print_out(True)
# Call function from 
build_continuous_features_report(raw_data)
Out[ ]:
Count Miss % Card. Min 1st Qrt. Mean Median 3rd Qrt Max Std. Dev.
overall 1000000 0.00 5 1 4 4.32 5.00 5 5 1.03
verified 1000000 0.00 2 False 0 0.67 1.00 1 True 0.47
unixReviewTime 1000000 0.00 7531 849225600 1356220800 1376980158.07 1408060800.00 1451692800 1525478400 118187110.14

We can see that th cardinality of overall = 5, that is a categorical feature. Verified is also a categorical feature. unixReviewTime is acutally a time, we can convert it to time series to see what does it mean.

We draw the histogram of those 3 features anyway.

In [ ]:
ax = mulitple_function_plots(data=raw_data.loc[:,['overall','verified' , 'unixReviewTime']], kde_type= False , plot_type="histogram",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['overall', 'verified', 'unixReviewTime'], dtype='object')
1 . Finish Rendering : overall , used 0 millseconds
2 . Finish Rendering : verified , used 1 millseconds
3 . Finish Rendering : unixReviewTime , used 1 millseconds

We can see there are a lot of scores are 5. That is not a good thing.

We may need to use stratified sampling later.

In [ ]:
ax = mulitple_function_plots(data=raw_data.loc[:,['overall','verified' , 'unixReviewTime']], kde_type= False , plot_type="boxplot",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['overall', 'verified', 'unixReviewTime'], dtype='object')
1 . Finish Rendering : overall , used 1 millseconds
2 . Finish Rendering : verified , used 3 millseconds
3 . Finish Rendering : unixReviewTime , used 7 millseconds

Outliers in unixReviewTime could just be very old reviews. It doesn't matter too much.

There are a few outliers in overall. We need to investigate it further.

1.1.2 Text item properties

we will substitute the actual text items with their properties such as:

  1. Text length (i.e., the number of characters).
  2. The number of words.
  3. Presence of non-alphanumeric characters

  1. Any additional properties that you find useful in understanding text.

    Here we use a property called:

    Stop words count. We want to calculate how many stop words it has in every instance.

In [ ]:
raw_data.head()
Out[ ]:
overall verified reviewTime reviewerID asin style reviewerName reviewText summary unixReviewTime vote image
0 2 False 12 4, 2015 A273QRPDN6IQC8 0446676101 {'Format:': ' Kindle Edition'} Rub Chicken Book starts out with some really interesting i... ... with some really interesting ideas and get... 1449187200 NaN <NA>
1 5 True 04 24, 2016 A31Q39MDPVBTSX 0451473019 {'Format:': ' Hardcover'} BluegrassAnne was a gift for someone. he loved it he loved 1461456000 NaN <NA>
2 4 True 10 30, 2014 A353XVWAOOUCQS 0385352107 {'Format:': ' Kindle Edition'} Reader in the Pacific I have read a number of Murakami novels and th... Hollow Man 1414627200 NaN <NA>
3 5 True 10 9, 2013 A1DPPR0FYZ8B4A 042525609X {'Format:': ' Paperback'} Wonder Woman This book is a stand alone read as some have m... Letting go of demons...to embrace deserved love. 1381276800 NaN <NA>
4 5 False 08 15, 2015 A2XS90TMQ26YYH 0399146695 {'Format:': ' Kindle Edition'} J. Williams Great story! Highly recommended! Five Stars 1439596800 NaN <NA>

A function that get those 4 information from the original dataset.

First we get the data of nltk to get our stop words.

In [ ]:
import nltk
# download the basic list of data and models
nltk.download('popular')

# download "book" collection of datasets from NLTK website
nltk.download("book")
[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package stopwords to /root/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package treebank to /root/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package omw to /root/nltk_data...
[nltk_data]    |   Package omw is already up-to-date!
[nltk_data]    | Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]    |   Package omw-1.4 is already up-to-date!
[nltk_data]    | Downloading package wordnet to /root/nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet2021 to /root/nltk_data...
[nltk_data]    |   Package wordnet2021 is already up-to-date!
[nltk_data]    | Downloading package wordnet31 to /root/nltk_data...
[nltk_data]    |   Package wordnet31 is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]    |   Package wordnet_ic is already up-to-date!
[nltk_data]    | Downloading package words to /root/nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package maxent_ne_chunker is already up-to-date!
[nltk_data]    | Downloading package punkt to /root/nltk_data...
[nltk_data]    |   Package punkt is already up-to-date!
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package snowball_data is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection popular
[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package dependency_treebank is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package ieer to /root/nltk_data...
[nltk_data]    |   Package ieer is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package nps_chat to /root/nltk_data...
[nltk_data]    |   Package nps_chat is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package ppattach to /root/nltk_data...
[nltk_data]    |   Package ppattach is already up-to-date!
[nltk_data]    | Downloading package reuters to /root/nltk_data...
[nltk_data]    |   Package reuters is already up-to-date!
[nltk_data]    | Downloading package senseval to /root/nltk_data...
[nltk_data]    |   Package senseval is already up-to-date!
[nltk_data]    | Downloading package state_union to /root/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to /root/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package swadesh to /root/nltk_data...
[nltk_data]    |   Package swadesh is already up-to-date!
[nltk_data]    | Downloading package timit to /root/nltk_data...
[nltk_data]    |   Package timit is already up-to-date!
[nltk_data]    | Downloading package treebank to /root/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package toolbox to /root/nltk_data...
[nltk_data]    |   Package toolbox is already up-to-date!
[nltk_data]    | Downloading package udhr to /root/nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to /root/nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package webtext to /root/nltk_data...
[nltk_data]    |   Package webtext is already up-to-date!
[nltk_data]    | Downloading package wordnet to /root/nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]    |   Package wordnet_ic is already up-to-date!
[nltk_data]    | Downloading package words to /root/nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package maxent_treebank_pos_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package maxent_ne_chunker is already up-to-date!
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package universal_tagset is already up-to-date!
[nltk_data]    | Downloading package punkt to /root/nltk_data...
[nltk_data]    |   Package punkt is already up-to-date!
[nltk_data]    | Downloading package book_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package book_grammars is already up-to-date!
[nltk_data]    | Downloading package city_database to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package city_database is already up-to-date!
[nltk_data]    | Downloading package tagsets to /root/nltk_data...
[nltk_data]    |   Package tagsets is already up-to-date!
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package panlex_swadesh is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection book
Out[ ]:
True
In [ ]:
from nltk.book import *
from nltk.corpus import stopwords    
stop_words = set(stopwords.words('english'))
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

A function to get the reports.

This function will show:

  1. Text length (i.e., the number of characters).
  2. The number of words.
  3. Presence of non-alphanumeric characters
  4. Stop words accounts
In [ ]:
#--------------text_item_properties---------------#
'''We want to save all the results to a new dataframe'''
def text_item_properties(data):
  result = pd.DataFrame()
  data = data.copy()
  # we just fill na with 0 here. Without doing so, there will be an error
  data = data.fillna('0')
  for i in range(len(data.columns)):
    # get character length
    result[str(str(data.columns[i])) + 'Text_length'] = data[str(data.columns[i])].str.len() 
    # get number of words
    result[str(str(data.columns[i])) + 'num_of_words'] = data[str(data.columns[i])].str.split().str.len()
    # get non alphanumeric number
    result[str(str(data.columns[i])) + 'presence_non_alphanumeric'] = data[str(data.columns[i])].str.replace('[a-zA-Z0-9 ]', '').str.len()
    # get stop words account
    result[str(str(data.columns[i])) + 'stop_words_count'] = data[str(data.columns[i])].str.split().apply(lambda x: len(set(x) & stop_words)) 
  return result 

We will follow the importance of those text items columns.

First we take a look at review ID.

1. Reviewer ID

Use our function we defined before on single column, ReviewID

In [ ]:
reports_reviewerID = text_item_properties( raw_data.loc[:, ['reviewerID']]) ;
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()

Save the results into pkls

In [ ]:
# dump it into pkls
joblib.dump(reports_reviewerID, 'reports_reviewerID.pkl')
Out[ ]:
['reports_reviewerID.pkl']

Our reports head:

In [ ]:
reports_reviewerID.head()
Out[ ]:
reviewerID_Text_length reviewerID_num_of_words reviewerID_presence_non_alphanumeric reviewerID_stop_words_count
0 14 1 0 0
1 14 1 0 0
2 14 1 0 0
3 14 1 0 0
4 14 1 0 0

Now, let's print the continous report for those 4 properties the ReviewID feature has.

In [ ]:
build_continuous_features_report(reports_reviewerID)
Out[ ]:
Count Miss % Card. Min 1st Qrt. Mean Median 3rd Qrt Max Std. Dev.
reviewerID_Text_length 1000000 0.0 7 10 13.0 13.739826 14.0 14.0 20 0.463057
reviewerID_num_of_words 1000000 0.0 1 1 1.0 1.000000 1.0 1.0 1 0.000000
reviewerID_presence_non_alphanumeric 1000000 0.0 1 0 0.0 0.000000 0.0 0.0 0 0.000000
reviewerID_stop_words_count 1000000 0.0 1 0 0.0 0.000000 0.0 0.0 0 0.000000

I already generate the plot function in As1 and As2.

Hence, we don't write the function anymore, I put it in the Utility Function block.

Now,we just use it.

1. Historgram Plot
In [ ]:
ax = mulitple_function_plots(data=reports_reviewerID,kde_type = False, plot_type="histogram",data_type="number", fig_size=(10,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewerID_Text_length', 'reviewerID_num_of_words',
       'reviewerID_presence_non_alphanumeric', 'reviewerID_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewerID_Text_length , used 0 millseconds
2 . Finish Rendering : reviewerID_num_of_words , used 0 millseconds
3 . Finish Rendering : reviewerID_presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : reviewerID_stop_words_count , used 0 millseconds
  1. The cardinality of reviewID's number of Charaters are varied.
  2. Only 1 word in every instance.
  3. here is no non-alphanumeric instances.
  4. No stop words.
2. Boxplot
In [ ]:
ax = mulitple_function_plots(data=reports_reviewerID,plot_type="boxplot",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewerID_Text_length', 'reviewerID_num_of_words',
       'reviewerID_presence_non_alphanumeric', 'reviewerID_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewerID_Text_length , used 0 millseconds
2 . Finish Rendering : reviewerID_num_of_words , used 1 millseconds
3 . Finish Rendering : reviewerID_presence_non_alphanumeric , used 2 millseconds
4 . Finish Rendering : reviewerID_stop_words_count , used 2 millseconds

We can see we actually have outliers in Text length.

The reviewID is varied. That could be a problem. But if we don't need this feature, then it's fine.

2. summary

Now, we take a look at summary.

In [ ]:
reports_summary = text_item_properties(raw_data.loc[:, ['summary']]) 
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()
In [ ]:
# dump it into pkls
joblib.dump(reports_summary, 'reports_summary.pkl')
Out[ ]:
['reports_summary.pkl']
In [ ]:
build_continuous_features_report(reports_summary)
Out[ ]:
Count Miss % Card. Min 1st Qrt. Mean Median 3rd Qrt Max Std. Dev.
summary_Text_length 1000000 0.0 214 1 10.0 25.559311 19.0 34.0 799 19.256370
summary_num_of_words 1000000 0.0 58 0 2.0 4.562731 3.0 6.0 149 3.537520
summary_presence_non_alphanumeric 1000000 0.0 79 0 1.0 4.398029 3.0 6.0 164 4.381444
summary_stop_words_count 1000000 0.0 23 0 0.0 1.044313 0.0 2.0 30 1.576715

Standard deviation of summary text is a little bit high. It's more than 19.26. People may write a very varied length of summarys.

1. Histogram Plot
In [ ]:
ax = mulitple_function_plots(data=reports_summary,kde_type = False, plot_type="histogram",data_type="number", fig_size=(10,7))
Those features will be plotted in  2  rows and  2 columns
Index(['summary_Text_length', 'summary_num_of_words',
       'summary_presence_non_alphanumeric', 'summary_stop_words_count'],
      dtype='object')
1 . Finish Rendering : summary_Text_length , used 0 millseconds
2 . Finish Rendering : summary_num_of_words , used 0 millseconds
3 . Finish Rendering : summary_presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : summary_stop_words_count , used 0 millseconds

They are Poisson distribution of all features.

The distribution is fair enough.

2. Boxplot
In [ ]:
ax = mulitple_function_plots(data=reports_summary,plot_type="boxplot",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['summary_Text_length', 'summary_num_of_words',
       'summary_presence_non_alphanumeric', 'summary_stop_words_count'],
      dtype='object')
1 . Finish Rendering : summary_Text_length , used 0 millseconds
2 . Finish Rendering : summary_num_of_words , used 1 millseconds
3 . Finish Rendering : summary_presence_non_alphanumeric , used 2 millseconds
4 . Finish Rendering : summary_stop_words_count , used 3 millseconds

There are a lot of outliers in those 4 features.

People do write their summary in varied.

But we would say that most of people's summary only has a few words.

In summary, most of them are

  1. Text_length less than 100
  2. Num of words less than 10
  3. Non-alphanumeric less than 15
  4. Stop words count less than 5

3. reviewText

Review Text will be our main feature. It has a lot of words in it, and we will based on this feature to predict our overall scores.

Let's take a look of its report.

In [ ]:
reports_reviewText = text_item_properties( raw_data.loc[:, ['reviewText']]);
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()
In [ ]:
# dump it into pkls
joblib.dump(reports_reviewText, 'reports_reviewText.pkl')
Out[ ]:
['reports_reviewText.pkl']
In [ ]:
build_continuous_features_report(reports_reviewText)
Out[ ]:
Count Miss % Card. Min 1st Qrt. Mean Median 3rd Qrt Max Std. Dev.
reviewText_Text_length 1000000 0.0 7981 1 112.0 563.448763 221.0 626.0 32675 906.886441
reviewText_num_of_words 1000000 0.0 1980 0 21.0 99.983910 41.0 113.0 5853 156.915644
reviewText_presence_non_alphanumeric 1000000 0.0 2294 0 24.0 117.892685 47.0 132.0 7556 187.985523
reviewText_stop_words_count 1000000 0.0 117 0 6.0 18.044168 12.0 26.0 121 16.694456

Standard deviation is very high in reviewText, we need to investigate this further more.

1. Histogram Plot
In [ ]:
ax = mulitple_function_plots(data=reports_reviewText,kde_type= False, plot_type="histogram",data_type="number", fig_size=(10,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewText_Text_length', 'reviewText_num_of_words',
       'reviewText_presence_non_alphanumeric', 'reviewText_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewText_Text_length , used 0 millseconds
2 . Finish Rendering : reviewText_num_of_words , used 0 millseconds
3 . Finish Rendering : reviewText_presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : reviewText_stop_words_count , used 1 millseconds

We have a lots of text length near to 0 and number of words near to 0. This is not a good sign. We will investigate those further.

2. Boxplot
In [ ]:
ax = mulitple_function_plots(data=reports_reviewText, kde_type= False , plot_type="boxplot",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewText_Text_length', 'reviewText_num_of_words',
       'reviewText_presence_non_alphanumeric', 'reviewText_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewText_Text_length , used 1 millseconds
2 . Finish Rendering : reviewText_num_of_words , used 2 millseconds
3 . Finish Rendering : reviewText_presence_non_alphanumeric , used 4 millseconds
4 . Finish Rendering : reviewText_stop_words_count , used 5 millseconds

There is an instance that has 5800 number of words. That is a huge number of words.

And there are huge scale of outliers in the first three features.

We must investigate those in the next chapter.

4. style

In [ ]:
reports_style = text_item_properties( raw_data.loc[:, ['style']]) 
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()
In [ ]:
# dump it into pkls
joblib.dump(reports_style, 'reports_style.pkl')
Out[ ]:
['reports_style.pkl']
In [ ]:
build_continuous_features_report(reports_style)
Out[ ]:
Count Miss % Card. Min 1st Qrt. Mean Median 3rd Qrt Max Std. Dev.
style_Text_length 1000000 0.00 22 1 25.00 28.49 30.00 30.00 47 4.19
style_num_of_words 1000000 0.00 5 1 3.00 3.69 4.00 4.00 6 0.67
style_presence_non_alphanumeric 1000000 0.00 7 0 10.00 10.65 11.00 11.00 15 1.02
style_stop_words_count 1000000 0.00 2 0 0.00 0.00 0.00 0.00 1 0.02

Our report shows style feature has relatively similar properties among all 4 perporties we list here.

1. Historgram Plot
In [ ]:
ax = mulitple_function_plots(data=reports_style,kde_type = False, plot_type="histogram",data_type="number", fig_size=(10,7))
Those features will be plotted in  2  rows and  2 columns
Index(['style_Text_length', 'style_num_of_words',
       'style_presence_non_alphanumeric', 'style_stop_words_count'],
      dtype='object')
1 . Finish Rendering : style_Text_length , used 0 millseconds
2 . Finish Rendering : style_num_of_words , used 0 millseconds
3 . Finish Rendering : style_presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : style_stop_words_count , used 0 millseconds

We can see very high non - alphanumeric characters. We need to delete those. And the num of words are varied.

And we have no stop words in this feature.

2. Boxplot
In [ ]:
ax = mulitple_function_plots(data=reports_summary,plot_type="boxplot",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['summary_Text_length', 'summary_num_of_words',
       'summary_presence_non_alphanumeric', 'summary_stop_words_count'],
      dtype='object')
1 . Finish Rendering : summary_Text_length , used 0 millseconds
2 . Finish Rendering : summary_num_of_words , used 1 millseconds
3 . Finish Rendering : summary_presence_non_alphanumeric , used 2 millseconds
4 . Finish Rendering : summary_stop_words_count , used 3 millseconds

We can see that there are a lot of non alphanumeric outliers here.

And a few outliers of number of words.

We will investigate it in the next chapter.

5. Reviewer Name

In [ ]:
reports_reviewerName = text_item_properties( raw_data.loc[:, ['reviewerName']]) 
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()
In [ ]:
# dump it into pkls
joblib.dump(reports_reviewerName, 'reports_reviewerName.pkl')
Out[ ]:
['reports_reviewerName.pkl']
In [ ]:
build_continuous_features_report(reports_reviewerName)
Out[ ]:
Count Miss % Card. Min 1st Qrt. Mean Median 3rd Qrt Max Std. Dev.
reviewerName_Text_length 1000000 0.00 125 1 7.00 11.15 11.00 15.00 326 5.46
reviewerName_num_of_words 1000000 0.00 27 0 1.00 1.85 2.00 2.00 27 0.84
reviewerName_presence_non_alphanumeric 1000000 0.00 44 0 0.00 1.13 1.00 2.00 109 1.35
reviewerName_stop_words_count 1000000 0.00 11 0 0.00 0.02 0.00 0.00 10 0.18

Similar type of features as same as style.

There is no much variation here.

1. Historgram Plot
In [ ]:
ax = mulitple_function_plots(data=reports_reviewerID,kde_type = False, plot_type="histogram",data_type="number", fig_size=(10,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewerID_Text_length', 'reviewerID_num_of_words',
       'reviewerID_presence_non_alphanumeric', 'reviewerID_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewerID_Text_length , used 0 millseconds
2 . Finish Rendering : reviewerID_num_of_words , used 0 millseconds
3 . Finish Rendering : reviewerID_presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : reviewerID_stop_words_count , used 0 millseconds

No non alphanumerica characters.

No stop words.

All instances has one word.

We may drop this feature. Since not too much helping for distinguishing our scores.

2. Boxplot
In [ ]:
ax = mulitple_function_plots(data=reports_reviewerID,plot_type="boxplot",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewerID_Text_length', 'reviewerID_num_of_words',
       'reviewerID_presence_non_alphanumeric', 'reviewerID_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewerID_Text_length , used 1 millseconds
2 . Finish Rendering : reviewerID_num_of_words , used 1 millseconds
3 . Finish Rendering : reviewerID_presence_non_alphanumeric , used 2 millseconds
4 . Finish Rendering : reviewerID_stop_words_count , used 3 millseconds

Reviewer name is one of the not very important feature.

It doesn't mean too many things. We will drop it in the future. So no further investigation here.

6. asin

In [ ]:
reports_asin = text_item_properties(raw_data.loc[:, ['asin']]) 
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()
In [ ]:
# dump it into pkls
joblib.dump(reports_asin, 'reports_asin.pkl')
Out[ ]:
['reports_asin.pkl']
In [ ]:
build_continuous_features_report(reports_asin)
Out[ ]:
Count Miss % Card. Min 1st Qrt. Mean Median 3rd Qrt Max Std. Dev.
asin_Text_length 1000000 0.00 1 10 10.00 10.00 10.00 10.00 10 0.00
asin_num_of_words 1000000 0.00 1 1 1.00 1.00 1.00 1.00 1 0.00
asin_presence_non_alphanumeric 1000000 0.00 1 0 0.00 0.00 0.00 0.00 0 0.00
asin_stop_words_count 1000000 0.00 1 0 0.00 0.00 0.00 0.00 0 0.00

No Std. Dev at all. Since asin are format number for representing books.

1. Historgram Plot
In [ ]:
ax = mulitple_function_plots(data=reports_reviewerID,kde_type = False, plot_type="histogram",data_type="number", fig_size=(10,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewerID_Text_length', 'reviewerID_num_of_words',
       'reviewerID_presence_non_alphanumeric', 'reviewerID_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewerID_Text_length , used 0 millseconds
2 . Finish Rendering : reviewerID_num_of_words , used 0 millseconds
3 . Finish Rendering : reviewerID_presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : reviewerID_stop_words_count , used 0 millseconds

Same as reviewer Name.

2. Boxplot
In [ ]:
ax = mulitple_function_plots(data=reports_reviewerID,plot_type="boxplot",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewerID_Text_length', 'reviewerID_num_of_words',
       'reviewerID_presence_non_alphanumeric', 'reviewerID_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewerID_Text_length , used 0 millseconds
2 . Finish Rendering : reviewerID_num_of_words , used 1 millseconds
3 . Finish Rendering : reviewerID_presence_non_alphanumeric , used 2 millseconds
4 . Finish Rendering : reviewerID_stop_words_count , used 2 millseconds

ASIN is actually a number for identifying the book names.

Well, it's good to get the book name. But most of API to do so cost many.

And way beyond the purpose of our course.

Hence, we don't take further investigation on it.

1.2. Identify data quality issues and build the data quality plan.

First we take another look at the most important feature, review text and summary.

1.2.1 Identify Major Data quality issues

1. Review Text

In [ ]:
ax = mulitple_function_plots(data=reports_reviewText,kde_type= False, plot_type="histogram",data_type="number", fig_size=(10,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewText_Text_length', 'reviewText_num_of_words',
       'reviewText_presence_non_alphanumeric', 'reviewText_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewText_Text_length , used 0 millseconds
2 . Finish Rendering : reviewText_num_of_words , used 0 millseconds
3 . Finish Rendering : reviewText_presence_non_alphanumeric , used 1 millseconds
4 . Finish Rendering : reviewText_stop_words_count , used 1 millseconds
In [ ]:
ax = mulitple_function_plots(data=reports_reviewText, kde_type= False , plot_type="boxplot",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewText_Text_length', 'reviewText_num_of_words',
       'reviewText_presence_non_alphanumeric', 'reviewText_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewText_Text_length , used 2 millseconds
2 . Finish Rendering : reviewText_num_of_words , used 5 millseconds
3 . Finish Rendering : reviewText_presence_non_alphanumeric , used 6 millseconds
4 . Finish Rendering : reviewText_stop_words_count , used 8 millseconds

We can see there are a lot of outliers in all our 4 text items' properties.

That is not good.

We need to investigate further about this outliers.

1. Outliers investigation

First we take a look at the instances with the most number of words in review Text.

In [ ]:
reports_reviewText[reports_reviewText['reviewText_num_of_words'] > 5800]
Out[ ]:
reviewText_Text_length reviewText_num_of_words reviewText_presence_non_alphanumeric reviewText_stop_words_count
470862 32675 5853 7556 108

We get the item by use .index

In [ ]:
raw_data.loc[reports_reviewText[reports_reviewText['reviewText_num_of_words'] > 5800].index].T
Out[ ]:
470862
overall 2
verified False
reviewTime 07 27, 2012
reviewerID A3NZBAGW4AKU5E
asin 0385474466
style {'Format:': ' Paperback'}
reviewerName R. Cannata
reviewText Posner attempts a thorough refutation of all e...
summary Definitive anti-conspiracy work is informative...
unixReviewTime 1343347200
vote 15.0
image <NA>

We save it into a new data structure and see what's inside.

In [ ]:
outlier_example = raw_data.loc[reports_reviewText[reports_reviewText['reviewText_num_of_words'] > 5800].head(1).index]
In [ ]:
max_print_out(True)
outlier_example.reviewText.values
Out[ ]:
<StringArray>
['Posner attempts a thorough refutation of all explanations of the JFK assassination that deviate in any way from the official (Warren Commission of 1964) one: Oswald was the only person involved in the assassination, and Ruby was the only one involved in killing Oswald.  The conspiracy theorists will also be called "the critics" below.  Some of the conspiracy people\'s (critics\') views are supported by the report of the U.S. House Select Committee on Assassinations (1979), a Congressional study.\n\nBelow are just a few notes of random things that interested me from Posner.  These notes do NOT constitute a summary of the book, but just some impressions.\n\nIn general, Posner tends to be needlessly wordy at points, but is basically a very good writer, discipled thinker and well-organized.  He is moderate in tone, though he tends to concede nothing and is clever in obscuring facts that are inconvenient to his main thesis.\n\nIf Oswald was a committed Marxist by 9th grade and increasingly anti-American, why did he join the Marines at age 17?  Posner -- to escape his mother (as his older brother had). Posner is convincing.  p. 19\nWas Oswald a highly skilled marksman (as the official explanation necessitates) or was he not (as the critics contend)?  pp. 20ff.  Posner shows Oswald took basic Marine rifle training.  In 1956 his score was 212, which marked him as above average.  He could hit a 10-inch, non-moving bulls eye at 200 yards with a M-1 eight out of ten times.  But three years later (closer to the assassination) he scored only a 191 (still a hair above average).  JFK was further away, moving, Oswald\'s assassination rifle was far inferior to an M-1, and he had way less time to aim and pull the bolt action.\nPosner claims Oswald was abused in the Marines and was treated by all the others as a weakling (pp. 21, 25, etc.)  But this seemed weird when contrasted with his tough upbringing, his street fighting experience, and all the ink spilled by Posner a few pages earlier about how Oswald was always the bully, the toughest kid in school.\nPosner portrays Oswald as a total loser, utter misfit, unpopular, mentally imbalanced, on the edge of being court marshalled all the time, yet he is marked for promotion to corporal within just a year of service, at age 18, which seems very fast. p. 23.  (O.\'s promotion was never enacted because he got in a fist-fight with an officer just before it was ratified).\nWhile I don\'t believe (as some critics allege) that O. was being paid by some outside group (FBI, etc.) while in marines, I find it odd how he spent so much money.  Posner dismisses this in a sentence, but the numbers don\'t add up well.  O.\'s salary was just $85 a month.  Yet by the time he left the Marines in two years he\'d saved up $1500 cash (75% of his salary).  In two years he\'d spent just $500 ($250 a year, or $20 a month)?  This included the $55 fine for his fist-fight.  So just $445 left for 24 months?Yet Posner relates all the trips to bars, traveling 100 miles at a time etc. that he did.  It just seems weird to me, but doesn\'t fit any theory for me.\nPosner contends that O. was a pathological liar.  That may have been true.  But Posner\'s use of evidence here is often very similar to that of the critics he blasts.  For example, Posner\'s claim that O. told people he applied to the Univ. of Turku in Finland but O. never did (p. 32n.).  Just because Posner (may have) asked U of Turku about O.\'s application doesn\'t mean they saved all of their rejected applications for 35 years.\nThe story of KGB Lt. Col. turned defector to U.S., Yuriy Nosenko, in June 1962 -- shocking and sad!  (pp. 34ff.)  Was he a KGB plant or bona fide defector?  According to Posner he gave the CIA great info that led to lots of KGB agents, but the paranoia of the CIA thought he was a double-agent and arrested, imprisoned and tortured him for years.  By 1969 the US admitted its mistake (after realizing he gave them 9 major cases of Soviet spies, not just disposable ones either), and put him on the payroll.  Today the CIA and FBI, Posner reports, regard him as a bona fide defector.\nBut here Posner seems to be stretching his case a little: Nosenko was in charge of American defectors (all of them?  Even Posner shows that Nosenko lost touch with O.).  When O. defected to Russia (for two years) Posner claims the KGB never even interviewed him (pp. 47-48).  That seems extremely unlikely, given the massive spy and observation network they had in Russia at the time (as Posner himself shows in that chapter).  Further, O. had worked on a US base in Japan from which U2 spy planes were launched.  Wouldn\'t the KGB want to ask him about that?  No, says Nosenko.  Why?  "We already knew everything about the U2 (p. 48)"  (Really?!?) "And he wasn\'t a mechanic on them or anything" (True, but how would the KGB have known that back then?  Esp. if they never interviewed him!)  And later Posner himself reports how when O. was sent to Minsk (for MOST of the two years he was in Russia) Nosenko lost track of him (p. 53).  And yet Posner himself tells us how the KGB watched O. more than they needed to, and had informers befriend him (p. 58).  The reason it is important to Posner that O. never talked to the KGB while an American living in Russia two years and while the KGB was watching him constantly -- Posner wants to show that O. was not on the Soviet payroll for the assassination.  I am sure O. was not.  But Posner has not come close to proven that O. never met with the KGB, and he doesn\'t need to anyway.  I think the KGB played no role in the assassination.\nPosner says the Soviets were not going to allow O. to stay in Russia when he traveled there on vacation in 1959 and then announced he wanted to defect.  They knew he was a loser and they didn\'t want him.  In fact, they gave him two hours to pack and leave.  BUT then he slit his wrists in angst over their refusal.  They put him in a loony bin and the psychiatrists found him "insane."  SO THEN they changed their minds and let him stay (p. 50).  That just seems SO IMPROBABLE.  It doesn\'t change much whether it is true or not, but it is an important plank in Posner\'s case that O. was just batty for years.\nPosner says the Soviets had no interest at all in O. yet he reports how they gave him $500 when he defected, let him stay, etc. (p. 56).  Again, I agree with Posner that the KGB played no role in the assassination, it\'s just that Posner feels the need to somehow contend the Soviets didn\'t speak to O. during his defection.\nO. married Russian girl Marina quickly in 1961 (after knowing her just a  few weeks) while on the rebound from a breakup with another Russian girl (pp. 64ff.).  He is 21, she is younger.\nDisillusioned by the Soviets (but not by Marxism), Posner says O. returned to US in 1961, seeing the U.S. as the "lesser of two evils." (p. 75).\nWhen he got back, the FBI and CIA met with O. in TX several times.  For years the CIA denied that they had met with him, knowing this was not true.  And it was NOT common to debrief returning defectors.  Of the 22 American defectors who returned to the U.S. 1958-63 only four (18%) met with the CIA or FBI.  Further, the record of the CIA meeting with O. is noted, but the notes and files have been `lost.\'  In fact, the agent who did one of the interviews in 1962, Andy Anderson, has not been located.  It is very understandable, but all this raises suspicion of the critics.  Posner relegates this to a footnote (p. 78n) instead of the body of his work.\nO. was largely friendless except for one aristocratic, politically liberal, older, educated emigre -- George de Mohrenschildt (pp. 84ff.)\nO. bought the rifle he used in the assassination (and he WAS definitely at least A shooter) for just $21.45 in 1961.  The 6.5mm Mannlicher-Carcano seems a poor choice to assassinate with.  Italians (who made it) joke that its a "humanitarian" gun, because it can\'t hit anyone (p. 103).  But Posner claims it was sufficient for the job.  And he goes further and quotes one "expert" stating that as long as you have a scope you need no training to hit your target.  This is the same kind of stupid claim the critic often make, and Posner rightly dismisses them for it.\nPosner reports that O. tried to assassinate a right-wing retired Gen. Walker in his home in Dallas.  O. totally missed on a much easier shot than the ones killing JFK. (p. 114).  Gen. Walker was sitting still, the shot was much closer, and O. had all the time in the world to set up (being in the dark, with no guards or crowds or moving objects).  Yet he totally missed.  This doesn\'t prove he could not fatally hit JFK, just makes it even less likely.  But Posner spins this incident to prove O. is a nut. We know that already.\nAgain, Posner relegates a very interesting part of this mystery to a footnote, since it interrupts the smooth fabric of his thesis.  O.\'s best friend in TX, the rich, educated, socialite, liberal, older aristocrat de Mohrenschildt: in 1977 he told Edward Jay Epstein (who is a balanced, non-wacko moderate critic of the official JFK story, with a masters from Cornell) that the CIA in 1962 asked him to keep tabs on O.  A few hours after telling Epstein that, Mohrenschildt killed himself with a shotgun blast to the head.  Posner dismisses Mohrenschildt\'s claim by saying M. was insane (this is what he says of almost EVERY person who claims something inconvenient to his thesis).  Maybe M. was crazy, I don\'t know.  But if his claim is true, taken together with the CIA meetings with O. that they denied forever, lost the notes from, but even Posner admits happened, this makes things suspicious.  Still I\'m unconvinced CIA played any role.\nPosner is generally very thorough in his use of details.  But then I notice lots of mistakes just among the small number of things I happen to know about.  This makes me wonder how careful and accurate he really is.  For example, p. 124 he talks about O. renting an apartment at a "two-story" house at 4907 Magazine St. (p. 124). I happen to walk by there every day and know that the house is one-story.  He misspelled Metairie (p. 142).  And calls Rampart St., "Ramparts" (p. 150). And I am 100% certain that there was never a Winn-Dixie on Magazine St. in 1963 (p. 171).  And he claims a birth record can\'t be found on Jack Ruby in 1911 because there were no official Chicago birth certificates til 1915 (p. 350n.)  But I have seen my great-grandmother\'s 1892 Chicago birth certificate and many others, pre-1915.  I don\'t think Posner is being dishonest in any of these, just sloppy and/or lazy about some details, which make me wonder about other details he uses with similar confidence.\nPosner has a habit of making claims that he cannot prove, but doing it in a clever way that a reader lacking diligence or skepticism can miss.  For example, he talks about the testimony of a Mr. Alba whose garage, Posner admitted, serviced cars for the  Secret Service and FBI.  Alba told the House Select Committee in 1979 that he saw O. approach an FBI car at his garage in 1963 and receive a fat envelop from its passenger before it drove off.  Posner is skeptical of this event (as anyone probably should be).  But he dismisses it by saying that "no FBI agents checked a car out of his garage during all of 1963"  (p. 131).  May be true, but how can Posner know that for sure?  The FBI or the garage honestly have complete records on that fact, that they maintained 15 years later?  No chance.\nO. is depicted by Posner as utterly stupid, foolish, only superficially knowledgable about politics, unlikeable, inarticulate, and unconvincing.  Yet he discusses how a Jesuit priest had O. travel to Mobile, Alabama from New Orleans to give a half hour talk on Russia and Communism to a group of students (p. 135).\nSome of O.\'s fliers from his pro-commie Fair Play for Cuba demonstrations in NOLA have survived with stamps and return addresses of 544 Camp St. on them.  This is the address of a small building that was office to Guy Bannister at the time.  Guy was, by Posner\'s admission, a "highly decorated ex-FBI agent who maintained a relationship with Naval Intelligence" and was doing work as a p.i. for Carols Marcello (New Orleans mafia\'s top godfather) and his attorney.  Bannister is placed as an associate of O. by several witnesses that Posner tries to discount.  But he also admits David Ferrie (who worked closely with Bannister and who had known O. for 10 years) was at 544 Camp alot, also working for Marcello.  Posner\'s explanation for how the 544 Camp address got on the fliers is extremely weak, in my estimation.  (p. 136).\nAnother tactic of Posner that is subtle and effective, but maybe not the most helpful: He wants to discredit someone who makes an eyewitness claim of something Posner finds inconvenient to his thesis.  If a critic of that witness has something negative to say about the witness that is untrue, he quotes that critic.  For example, Jack Martin worked with Bannister and said he saw Bannister often with O. etc.  He quotes another person (Badeaux) as saying Martin "drank, took pills, and had a criminal record."  (p. 138)  Why didn\'t Posner say it himself?  Because maybe its not fully true.  I bet Jack did drink and take pills (many do).  But did he have a serious criminal record?  If he did Posner would have told us more about it, or at least footnoted it.  And even if he did, it doesn\'t mean he is lying about O. meeting with Bannister.\nDid David Ferrie (who frequented 544 Camp, where his friend Guy Bannister worked and where O.\'s fliers have as their return address) know O?  They were in the Civil Air Patrol together in the 1955, when O. was a 15 or 16 (p. 142).  Posner weakly tries to dispute this, but admits it is possible.  I\'d say likely, based on what everyone else says and the many witnesses.  And then in 1963 there is the Bannister and 544 Camp connection.  And Ferrie is definitely strongly connected to Carols Marcello (New Orleans\' mafia kingpin) -- in fact, Posner fails to tell us by (House Chief Counsel) Blakely, the photos etc. all show Ferrie sitting next to Marcello in the courtroom when M. was acquitted of (unrelated) charges Nov. 22, 1963.\nSeveral witnesses claim O., Ferrie and maybe Clay Shaw showed up in Clinton, Louisiana in Sept. 1963.  (pp. 143ff.)  That is a weird story, and its unknown what connection it would have to the various conspiracy scenarios. Posner, of course, goes to great lengths to discredit the story.  And he is convincing.  It just seems odd that so many people (even though he picks apart each one as either a nut, or one who came to the story late, or who was intimidated or manipulated by NOLA DA Jim Garrison\'s people for the 1967 trial, or as contradicting details of what others say they saw) say they saw something, in such a small, sleepy town.  It seems likely that they saw SOMETHING.  But who knows?\nAGAIN that logical error: Years later a hospital in Clinton has no record of O.\'s alleged unsuccessful application for a job there (who keeps old resumes of everyone applying for menial, minimum wage jobs??), therefore he must have never really applied (as a witness said he did).  (p. 146).\nRegarding the Clinton, LA thing: Posner notes only in passing that the House Select Committee on the JFK thing in 1979 found some of the witnesses\' stories convincing (p. 147).  He dismisses this as the "power of suggestion."  But that is pretty significant.\nAfter most of 1963 in Oswald\'s native New Orleans (where he was born and raised and where his parents\' were born and grew up), he left New Orleans Sept. 24 (two months before killing JFK) for Dallas. (p. 144).\nThis anti-Castro guy still has a copy of O\'s Marine manual with O\'s signature in it, that O. gave him in 1963 (p. 151).\nWhen O. was arrested for his pro-Cuba demonstrations in New Orleans in 1963, the one who paid his bail was Emile Bruneau, a state boxing commissioner involved in organized crime.  This is not small.  How did mafia Bruneau know O. so well?  Why was he willing to post bail for O.?  Posner, again, knows he must deal with this fact, but minimizes it by relegating it to a footnote (p. 156n.).  He only acknowledges that Bruneau "knew" Nofio Pecorn, an `associate\' of New Orleans crime boss Carlos Marcello.  (BTW -- Jack Ruby\'s phone records, Posner admits, included calls to Nofio !?!?!  p. 362).  very suspicious, especially when added to the other mafia contacts with O.\nLike the fact that a cultured aristocrat was O\'s best friend, and like the fact that he\'s only got a 9th grade education but is being invited to give 30 minute talks to seminary students out of state, and that he\'s promoted to corporal at age 19 after just a year in the marines, the relentless Posner thesis that O. was a complete babbling idiot seems undermined by his own reporting of O\'s appearances on New Orleans radio.  Posner portrays Mr. Stuckey as a sharp journalist with a high-brow political discussion program.  Stuckey interviews O. on Cuba for 37 minutes and finds him an "articulate" spokesman (p. 160).\nStuckey is told O. is a commie and he goes to the FBI office the next day and reviews O\'s file personally and sees it is true.  This seems very unlikely -- the FBI trusted Stuckey?  The FBI just lets him walk in and they hand over O\'s file.  And, BTW, I thought the FBI wasn\'t that interested in O, according to Posner?  (p. 161).\nThe description that Mrs. Jesse Garner, O\'s 4907 Magazine St. landlady, gave of O\'s behavior is exactly what the current resident of 4907 recently told me!  (p. 167) That woman had grown up one block over.  Said the same things as Garner -- he dressed in a  black overcoat even in the summer, and put his trash in other people\'s cans because he was too cheap or poor to pay for his own trash pick up.  ha!  Or, I wonder, did that neighbor just read Mrs. Garner\'s statements in this book (or another one)?\nA woman named Odio was an organizer of anti-Castro people in TX.  She claims that O. came to her apt. with two other Cubans during a time when he was supposed to be in Mexico (p. 178).  The conspiracy people believe this, but Posner does not.  He does a decent job of discrediting Odio\'s claim, but I wonder what spin he is putting on it, since he admits that the House Select Committee in 1979 found her testimony "essentially credible" in its report, and that "there is a strong probability that one of the men was Lee Harvey Oswald." (p. 176).\nPosner holds that O. tried to defect to Cuba through Mexico, or maybe go back to Russia through Cuba via Mexico.  He gives some good evidence for this (pp. 180ff.).  Picky point but in that account he uses "embassy" and "consulate" interchangably.  But I thought the two were different.  Maybe I am wrong.  (I do know that there is a French "consulate" in New Orleans, but that its not  a full-blown "embassy").  The only reason this would be significant at all is that he quote the Cuban consul in Mexico making this mistake in a direct quote.  So if I am right, Posner inaccurately rendered the quote.  Makes you wonder how sloppy was Posner in editing other quotes?\nBesides Odio\'s testimony that O. wasn\'t in Mexico trying to get to Cuba, but in TX that week, there are other problems with Posner\'s case.  The CIA sent out some teletype description of the man Posner claims is O. at the Cuban consul in Mexico (Oct. 10, 1963, a week after the visit) to the FBI and it describes someone about 35 (O. was 24), 6 feet tall (O was 5\'9") etc.  It sent a photo that wasn\'t O.  When the Cuban consul who had the conversations with this man testified before the House Select Committee in 1979 he said this man did not look like the photos of O. (p. 188).  The House Select Committee investigator wrote a 266-page report on the alleged Mexico incident and concluded it was "likely that an Oswald imposter visited the Cuban and Soviet embassies."  (p. 189).  Now that is weird!\nOr maybe O. was in Mexico and something even weirder happened.  I do NOT believe the Cubans played any role in the assassination, but this is odd: Posner tells how on Nov. 25, 1963 (just three days after the assassination), a Nicaraguan went to the American embassy in Mexico claiming he knew something about O. visiting the Mexican consul and getting $6500 from Cuba.  How did this guy know (3 days after the killing) that O. had been at the consul in Mexico in Oct.?  It wasn\'t in the news for a very long time.  The US Ambassador Thomas Mann was not convinced the story was completely false (p. 194).  Posner reports in a small footnote that LBJ himself thought Castro was behind the assassination (p. 194n)!?!?  Another witness came forward ten days after the assassination to say O. got money at the Cuban consul in Mexico.  (Again, how would he know O. was in Mexico recently at the consul?)  Maybe these people saw him at the consul, just didn\'t see the money!??!?  Or maybe they were random made up tales that happened to overlap with plausible events?!?\nWere there three shots or four?  Posner does a nice, thorough job of giving the case for three shots only, all by O.  But I am not convinced for a number of reasons.  In the Warren Commission\'s 1964 report they interviewed a trillion witnesses who heard all different things.  One critic does an analysis of these ear-wtinesses and found 52% thought the shots came from the grassy knoll and only 39% from the book depository, where O. was set up.  But Posner takes him to task for the way he does analysis.  Posner comes out with a wildly different count.  But I\'m not sure I agree even with some examples he gives to support his argument.  For example, there is a cop who immediately ran toward the grassy knoll seconds after the shootings.  The critics reasonably counts him as thinking the shots came from the grassy knoll, yet Posner does not.  How?  Well, in the (probably long-winded, rambling testimony) of the cop he said something like "but I guess the shots could have been from any direction."  Of course, that is true.  He didn\'t see the shots and only heard them.  BUT he obviously DID think he heard them from the grassy knoll or he wouldn\'t have run that way.  In fact, MANY cops immediately headed toward the grassy knoll (see also pp. 247, 268), while only ONE went toward the book depository.  The House Select Committee found that of the 178 witness statements given immediately after the shooting, 44 percent had no idea where the shots came from, 28 percent were sure the Book Depository (O.\'s nest) and 14 percent the grassy knoll (p. 235).\nThe Warren Commission was a total mess and was rushed, as Posner shows in great detail.  Yet he wants to stick with their finding that there were just three shots, two connected, and all were from O. from the book depository (thus behind the pres.)  But the House Select Committee in 1979, working with more info., in less of a rush, found "there was a 95% certainty that a fourth shot was fired from the grassy knoll....and therefore a second gunman." (p. 237).\nPosner doesn\'t say this, but obviously part of this conclusion comes from the fact that most of the cops thought a shot came from the grassy knoll, and so did many other witnesses.  Part must come from their investigation of O.\'s connections to others wanting JFK dead etc.  But Posner makes it sound like the House Select Committee became "95% certain" based solely on the testimony of their sound experts who analyzed the audio recording of the events and heard four shots.  The shots were just impulses, barely audible, and only by high tech analysis.  Posner picks this apart and makes the case for why all of the top experts were wrong.\nFascinated by the "babushka lady."  She was seen filming everything, but nobody knows who she was and her film never turned up.  (Only one other amateur film caught the event).  What insights could be gained if her film had surfaced?  Why didn\'t she come forward?  (Crazy woman did later, but with no film).  (p. 259).\nNobody actually saw O. shoot except a construction worker facing the window from 100 feet away, Howard Brennan.  His testimony is very cool (and chilling) (pp. 246ff.)\nBecause of Brennan\'s testimony to a cop on the scene, the cop ALMOST caught O.  And an a.p.b. went out at 12:45, just 15 minutes after the shooting, describing O. almost to a t: "age 30, slender, 5 foot 10, 165, with a rifle." (p. 247).  Its because of that that the cop found O.\nO. races out of the building after the shooting.  They do a roll call of all of the staff of the book depository and ONLY O. is missing (p. 271).  He\'d grabbed a bus (that his former landlady was on!), got off when it got stuck in traffic, walked awhile, caught a cab, and got back to his hood.  Cop sees him.  He shoots cop in front of lots of witnesses (pp. 272ff.).  Runs into a hardware store to avoid other cop.  Manager thinks he\'s suspicious (p. 278).  Follows him out to movie theater.  Calls cops.  They arrest him.\nPosner discusses how O. had very little time to prepare (less than a day) and rushed so much he only brought four bullets (though his clip had room for six). He had never seen a presidential motorcade either and had no idea what to expect in terms of security (p. 249).  Posner must know that all of this adds to the sense that O. had help, and/or there was a second gunman.\nThe so-called three tramps were found in a railroad car a few blocks from the grassy knoll.  (p. 271).  They seemed too well dressed for hobos.  They were taken into custody, but released.  Very odd that the cops did not take their names.  This is very weird.  For years people tried to figure out who they were.  The tallest one was supposedly Charles Harrelson, a convicted contract killer (who was also Woody Harrelson\'s dad, who admitted it later).  But in 1992 Dallas police finally found a file with the names of the three tramps (HOW could it take that long?).  Two were real hobos, and Harrelson was not involved.\nGet O. to the police station (after the movie theater arrest) and the police captain says, "We\'re looking for a Lee Harvey Oswald as a possible suspect of the president\'s assassination.  He left work at the book depository after the shooting."  COp says, "You\'re in luck.  We already have him.  He\'s the one we arrested for shooting the patrolman." (p. 280).\nPosner mentions that they matched a fingerprint from the Book Depository scene with O. fingerprints on file from his 1963 arrest over the Cuba protest in New Orleans (p. 283). Makes me wonder -- did they not take his fingerprints when they arrested him after the movie theater thing and then knowing soon he was probably JFK\'s killer?  That is bizarre!\nJFK had a dispensation from the Ro. Catholic Church to eat meat on a Friday the day of the assassination.  On what basis?  (Nothing to do with assassination, just interested me).  (p. 285).\nPosner quotes some experts as saying much of JFk\'s autopsy was hurried and botched because of political pressures (p. 302).  This explains why some of its conclusions don\'t match the "one shooter from behind, two bullet" official explanation.\nThe official autopsy could not determine the path of the bullet and were not at all sure the front of the neck was an exit wound. (p. 303).  This is huge.  But Posner explains the problem was partly caused by the work to save the president, which resulting in a trake in the presidents\' neck.\nThis is pretty big: the doctors who worked on JFK at the hospital said at the time that the wound on the front of his neck was an "entrance" wound rather than an exit.  (Which would mean a second gunman from the grassy knoll).  The neck wound was very small, BTW, while exit wounds tend to be very large (pp. 304ff.), especially when they hit stuff inside as this one did.\nThe description of the head wound (huge in back) was also evidence of that as exit not entrance wound.  Doctors initially said they saw shattered cerebellum of the cortex area.  If true, that also meant shot from front.  (p. 309).  BUT they said it not only at the time, but even months later in the Warren Commission report!  Some of the doctors much later said they were wrong about that.  But two of the five stuck with the testimony that they saw a wound in the back of the head and cerebellum tissue.  That is huge.  Posner cleverly says "only two doctors" still think this, and doesn\'t mention that the two are 40% of the attending physicians (p. 311).\nSo we have the fact that many cops and eyewitnesses head the shot from the grassy knoll, doctors at the scene remember a wound being from front, etc. What about the film showing the president jerking back in one shot?  Posner explains how it is counter-intuitive but that if you get shot from behind you may jerk backwards instead of falling forwards.  (p. 313).  I have no idea.  I\'d like to ask a doctor who knows.\nWhat about the two motorcycle cops in the REAR of the car who got splattered with blood and brain tissue?  Again, Posner takes something as seemingly important and inconvenient as that and moves it to a footnote.  His explanation is that in the new `enhanced\' version of the home movie it shows they weren\'t really behind the car but actually at the side (p. 315).  That seems impossible.  I\'d like to see it myself.\nPart of the official explanation, that Posner defends, involves one of the bullets passing all the way through JFK\'s body, including hitting some bone on the way, leaving his body they passing through Gov. Connally\'s wrist, torso and thigh (giving him near fatal wounds too) (p. 316).  What a bullet!?!?\nHow did O. get off three shots?  In between each one he had to pump the bolt action, and aim again at a moving target, six stories below him, with obstructions (like a big oak tree).  And he did it successfully 2 out of 3?  It seems a little hard to believe.  The Warren Commission report estimated he had between 4.8 and 5.6 seconds to get off the three shots.  5.6 seconds, they found was the absolute minimum time needed (if you are an olympian).  So this was `possible.\'  (p. 319).  Does that sound convincing?  Try it with your hands in front of a mirror with a stop watch.  I can\'t do it with my fingers in 5.6 second.  Posner knows this.  So he makes a change that extends his time a little bit.  He says the assumption has been that shots 1 and 3 hit JFK, but if we make it shots 2 and 3 instead, that gives him 8 seconds instead of 5.6.  (p. 319ff.)  That is still pushing it, but gives more breathing room.  I think he is right.  IF there was just O. as the only gunman than it HAD to be shots 2 and 3 and a total of 8 seconds, or its basically close to impossible.  He bolsters the argument with pretty clever analysis of the ear-witness testimony, and a plausible alternative approach to interpreting the frame-by-frame analysis of the amateur film.\nPart of this scenario involves an oak tree limb basically vaporizing one of the shots (explaining why its bullet was never found).  But I wonder how a tree limb can vaporize a bullet, while Posner also thinks that one of the shots went through JFK, hitting bone, then through Connelly and remained intact enough that we found almost the entire bullet? (p. 329).\nThey interrogate O. for a total of 12 hours, over five separate times, while he is in custody for killing the president of the United States.  And YET not only are no recordings made (Posner plausibly reports that tape recording was not done in Dallas interrogations in 1963), but there aren\'t even any notes (p. 343).  He has to see how weird that seems and why this would fuel paranoia of the conspiracy people.  WHY not?  Was it because he was telling how the mafia helped him and somebody didn\'t want that recorded?  The same cop who let Ruby get so close and kill O. while in custody?\nThe material on Jack Ruby (pp. 350ff.) is fascinating.  Against all reports, Posner argues that Ruby had no real mob ties whatsoever.  Mostly he has some Dallas locals who "knew" him saying "no way. Ruby was a big mouth that the mob would never trust."  But even from Posner\'s own material this seems not right.  Posner admits he was an "union organizer" in a corrupt, mob-tied union (p. 352).  One of his close friends is convicted of narcotics trafficking and bribing police, another of his business partners (co-owner of one Ruby\'s night clubs) has a criminal record (p. 359).  His brother admits his strip club was a mafia hangout (p. 360).  Ruby was arrested 9 times in the last 14 years in Dallas (p. 360).  He is tough-guy, violent as hell (Posner gives many examples).  Again, relegated to footnote, Posner admits Ruby\'s "good friend" Lewis McWillie did crime business in Havana (The House Select Co']
Length: 1, dtype: string

Well, by searching this instance's ASIN. We got this is a book called

Case Closed: Lee Harvey Oswald and the Assassination of JFK

It seems like that this review Text is a part from the book, instead of a real review from customers. We don't want that.

Hence, we can set a threshold to get the instances with reasonbale number of words. In case we get something like this, which is not a review at all.

Now, we reuse our find outliers function in Assignment 1 to find out how many outliers are there.

In [ ]:
find_outliers(reports_reviewText, 'reviewText_num_of_words')
IQR = 113.0 - 21.0 = 92.0
MAX = 251.0
Min is 0
Num of min outliers:  0
Num of max outliers:  64716
Num of negative outliers:  1
Num of the original data set's whole instance 1000000
Rate of purged data/total data 0.064716

Let's take a look at what the review Text like with 300 words.

In [ ]:
reports_reviewText[reports_reviewText['reviewText_num_of_words'] > 300].head(1)
Out[ ]:
reviewText_Text_length reviewText_num_of_words reviewText_presence_non_alphanumeric reviewText_stop_words_count
3 1807 318 356 44
In [ ]:
max_print_out(True)
outlier_example = raw_data.loc[reports_reviewText[reports_reviewText['reviewText_num_of_words'] > 300].head(1).index]
outlier_example.reviewText.values
Out[ ]:
<StringArray>
["This book is a stand alone read as some have mentioned in their reviews. To really understand the whole dynamic behind the characters; you should really read the other books in the series.  I wondered how Dane was going to tell Raven' s story because there was just such a strong energy bouncing off Raven. Her feelings were raw in the Brown Felt books and the Dinner Party Girls. I felt like Lauren really needed to get into Raven's Psyche and give Raven's voice as well as her emotions a chance.\nLauren did it.  She wrote Raven as the edgy bitch some of us had become curios about.  Raven had been independent and alone for the majority of  her life. Letting people in her life felt wrong and placed her in a very vulnerable position.  She had never felt LOVED and cherished until she started a relationship with her bosses sister. Erin befriended Raven immediately and it was the relationship that began preparing Raven for the real deal, Jonah Warner. Jonah Warner, Levi' s brother, could not and would not be denied. It took Jonah patiently peeling Raven' s defenses away like a blossoming onion before she finally completely submitted.\nTheir relationship finally opened up as Raven starts letting go and trusting that Jonah is definitely not a hot fling.\nRaven eventually is forced to confront her demons and luckily she doesn't have to do it alone. Answers to questions she had been struggling to know finally were resolved.  Anger she was consumed with began to dissolve, and the closure she needed concerning her family finally happened. She was able to freely fall in love.\nDane never fails to let her fans look into the lives of the other characters. Once again we get to reconnect  with the rest of the family. The Brown Family and the Dinner Party Girls all come together to take care of Raven."]
Length: 1, dtype: string

This is a valid book review. Just very long.

Let's take a look at more word's review.

In [ ]:
max_print_out(True)
outlier_example = raw_data.loc[reports_reviewText[reports_reviewText['reviewText_num_of_words'] > 500].head(1).index]
outlier_example.reviewText.values
Out[ ]:
<StringArray>
["In The Last Man on Earth by Tracy Anne Warren, Madelyn and Zack are both advertising executives at a big firmwith a lot of rivalry and misunderstandings between them. On the outside, Madelyn cant stand Zack. But on the inside?\n\nMadelyn is not looking to settle. She broke up with her long time sweetheart knowing what she felt for him was not the love she was looking for. She wants a man who turns her inside out, makes her happy and loves her. Definitely not someone like Zack. Zack doesnt have relationships, he is definitely not looking to settle down and live happily ever after. But after some New Years fireworks between Zack and Madelyn, they soon find the chemistry between them impossible to resist. Not wanting to be the subject of office gossips or mix personal with business, they keep their relationship strictly between the two of themescaping for weekends together.\n\nSoon though, in a surprise even to herself, Madelyn finds herself falling harder and harder for Zack. Zack unembellished was better than any fragrance could ever hope to be. But Zack has a lot of issues in his past he has never moved on from that continue to hinder any relationships he hashe doesnt want love and he doesnt want marriage. He does want Madelyn, though in his own waybut Madelyn is still not willing to settle, no matter how much shes come to love Zack. Between their faltering relationship, office politics and different dreams for the future, both Zack and Madelyn are going to have to make their choices and live with their decisions. As hard as it might be\n\nThe Last Man on Earth was a fun, fast paced romance that at times was highly amusing! Zack and Madelyns interactions are so perfect, you could feel the air sizzling around them. Zack likes to egg Madelyn on, There was nothing quite like watching Madelyn Graysonget completely worked up. Especially when it was over him. I definitely got that Zack was not looking to get serious with anyone, not even Madelyn, but as a reader I could see what he couldn't seehe was falling for her deeper and deeper even though he didn't think he wanted to. Zack on the surface is full of fun but serious at workbut he had a sensitive side he hid very well. At times he gave off mixed messages to Madelyn, his insecurities peeking through. Madelyn is at times an independent, smart, very successful woman, but at other times it seemed like she was doing too much to please others which I found frustrating coming from her character! She was a little harder for me to understand. It was really fun watching Madelyn and Zack get to know each other. It started off as a sexual thing, but evolved into something that could be so much more.\n\nThe Last Man on Earth is a bit of a lengthy read (323 pages) that at times felt a bit drawn out, but for the most part Zack and Madelyns love affair kept me intrigued and wanting to get back to them! All of the characters were written well into The Last Man on Earth from the main characters to the secondary characters. I really enjoyed Madelyns friend Peg and I cant wait to read the next book in this series.\n\nId recommend The Last Man on Earth to any romance reader  especially if you like secret romances with great chemistry thats not too explicit."]
Length: 1, dtype: string

When words become more and more, the review text just become more and more make no sence.

We better purge all outliers by the number of words.

Before we moving furthuer, we need to know if these kind of reviews also exit in our kaggle dataset.

In [ ]:
kaggle_data = pd.read_json('/content/drive/MyDrive/A3/sample.jsonl', lines=True)
kaggle_data = kaggle_data.convert_dtypes()
kaggle_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 12 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   overall         1000000 non-null  Int64  
 1   vote            195489 non-null   string 
 2   verified        1000000 non-null  boolean
 3   reviewTime      1000000 non-null  string 
 4   reviewerID      1000000 non-null  string 
 5   asin            1000000 non-null  string 
 6   style           982181 non-null   object 
 7   reviewerName    999966 non-null   string 
 8   reviewText      999876 non-null   string 
 9   summary         999693 non-null   string 
 10  unixReviewTime  1000000 non-null  Int64  
 11  image           2233 non-null     object 
dtypes: Int64(2), boolean(1), object(2), string(7)
memory usage: 87.7+ MB
In [ ]:
kaggle_reports_reviewText = text_item_properties( kaggle_data.loc[:, ['reviewText']]);
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()
In [ ]:
joblib.dump(kaggle_reports_reviewText,'kaggle_reports_reviewText.pkl')
Out[ ]:
['kaggle_reports_reviewText.pkl']
In [ ]:
ax = mulitple_function_plots(data=kaggle_reports_reviewText, kde_type= False , plot_type="boxplot",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewText_Text_length', 'reviewText_num_of_words',
       'reviewText_presence_non_alphanumeric', 'reviewText_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewText_Text_length , used 1 millseconds
2 . Finish Rendering : reviewText_num_of_words , used 1 millseconds
3 . Finish Rendering : reviewText_presence_non_alphanumeric , used 2 millseconds
4 . Finish Rendering : reviewText_stop_words_count , used 3 millseconds

Well, the kaggle data also has the kind of instances with thousands of words.

People are writing reading thoughts on Amazon reviews channel. That is quite interesting.

We will figure out how to handle these later.

We don't need the kaggle dataset anymore. Delete for free the memory.

In [ ]:
del kaggle_data

There is no way around of these long reviews. For me personally, I would not read those long reviews or reading thoughts when I want to buy a book.

Hence, we better purge all of those outliers and see what's left.

Rewrite our find_outliers function to return the purge index. Since we are operating two different datasets with the same indexing system at a time.

In [ ]:
#-------------find_outliers-----------------
def find_outliers(data_df, parameter,* , drop=False, set_threshold=False, threshold_value = 350): # deal with outliers    
    '''detect and delete outliers '''
    # same with previous find_outliers function
    Q1 = data_df[parameter].quantile(0.25)
    Q3 = data_df[parameter].quantile(0.75)
    IQR = Q3-Q1
    
    print(f"IQR = {Q3} - {Q1} = {IQR}")
    print(f"MAX = {(Q3 + 1.5 * IQR)}")
    
    if Q1 > 1.5*IQR :
        print("Min: ", (Q1 - 1.5 * IQR))
    else:
        print("Min is 0")

    cut_out_value =  (Q3 + 1.5 * IQR) # normal outliers deleted
    # override the value if we set threshold
    if set_threshold == True:
        cut_out_value = threshold_value
    
    # get min outliers' index 
    # get max outliers' index
    min_outliers_df = data_df[(data_df[parameter] < (Q1 - 1.5 * IQR))]
    max_outliers_df = data_df[(data_df[parameter] > cut_out_value)]
    # get negtive outliers' index  
    negative_outliers_df = data_df[(data_df[parameter] <= 0)]         
    print("Num of min outliers: ", len(min_outliers_df))
    print("Num of max outliers: ", len(max_outliers_df))
    print("Num of negative outliers: ", len(negative_outliers_df))
    print("Num of the original data set's whole instance", len(data_df))
    print("Rate of purged data/total data", len(max_outliers_df)/ len(data_df))

    # It's pretty hard to drop multiple indexes at the same time
    # Because after one drop action, their index are changed from then
    # We need to alter the order of aboving codes. 
    # And it's pretty unnecessary for us to do this in our assignemnt
    # Since we don't have min outliers in this dataset
    # And negative values are not outliers
    # I decided to purge negative values in transformer instead of here
    return max_outliers_df.index
In [ ]:
purging_index = find_outliers(reports_reviewText, 'reviewText_num_of_words')
IQR = 113.0 - 21.0 = 92.0
MAX = 251.0
Min is 0
Num of min outliers:  0
Num of max outliers:  108609
Num of negative outliers:  1
Num of the original data set's whole instance 1000000
Rate of purged data/total data 0.108609

Max words is 251.0. And there are 108609 outliers.

Take nearly 10% of whole 1 million set.

Let's take a look at the boxplot of when we purged the outliers.

In [ ]:
ax = mulitple_function_plots(data=reports_reviewText.drop(purging_index), kde_type= False , plot_type="boxplot",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewText_Text_length', 'reviewText_num_of_words',
       'reviewText_presence_non_alphanumeric', 'reviewText_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewText_Text_length , used 0 millseconds
2 . Finish Rendering : reviewText_num_of_words , used 1 millseconds
3 . Finish Rendering : reviewText_presence_non_alphanumeric , used 2 millseconds
4 . Finish Rendering : reviewText_stop_words_count , used 2 millseconds

The number of words' feature has been taken care of. The review Text length still has very large number of outliers.

We take a look at its outliers on the purged set we just got.

In [ ]:
report_data_purged = reports_reviewText.drop(purging_index)
In [ ]:
# get the first maximum text length instance with our purged dataset.
report_data_purged[report_data_purged['reviewText_Text_length'] > 2700].head(1)
Out[ ]:
reviewText_Text_length reviewText_num_of_words reviewText_presence_non_alphanumeric reviewText_stop_words_count
156379 2797 218 652 20
In [ ]:
# print the instance from the raw_data and check if's index are the same.
max_print_out(True)
outlier_example = raw_data.loc[report_data_purged[report_data_purged['reviewText_Text_length'] > 2700].head(1).index]
outlier_example
Out[ ]:
overall verified reviewTime reviewerID asin style reviewerName reviewText summary unixReviewTime vote image
156379 4 False 03 14, 2001 A1EI1093WQNLCY 0441005810 {'Format:': ' Paperback'} David Johnson H. Beam Piper's Fuzzy novels,&nbsp;<a data-hoo... How do you know if a fuzzy alien is intelligent? 984528000 8 <NA>
In [ ]:
# get the value of this instance
demo_text = outlier_example.reviewText.values
demo_text
Out[ ]:
<StringArray>
['H. Beam Piper\'s Fuzzy novels,&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Little-Fuzzy/dp/159818797X/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Little Fuzzy</a>&nbsp;(first published in 1962),&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Fuzzy-Sapiens/dp/B000EG6BA8/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Fuzzy Sapiens</a>&nbsp;(originally published as&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/The-Other-Human-Race/dp/B000H0O5CC/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">The Other Human Race</a>&nbsp;in 1964), and&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Fuzzies-and-Other-People/dp/B000ENMVTQ/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Fuzzies and Other People</a>&nbsp;(first published in 1984), are perhaps the best treatment ever of the nature of intelligence in science-fiction.  The three novels deal with the assorted legal and political challenges which occur in the aftermath of the discovery of the Fuzzies--small, cute, furry humanoids--by human settlers on the planet Zarathustra.  Part crime drama, part space opera, Piper\'s novels remain a joy to read even though many of their early-1960\'s technological and cultural accouterments are a bit outdated.\n\nInterestingly, the third novel in the Fuzzies series, published posthumously, appeared after the publication of two "authorized" sequels penned by other authors: William Tuning\'s&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Fuzzy-Bones/dp/0441261825/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Fuzzy Bones</a>&nbsp;(1981) and Ardath Mayhar\'s&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Golden-Dream-A-Fuzzy-Odyssey/dp/0441297269/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Golden Dream: A Fuzzy Odyssey</a>&nbsp;(1982).  Along with Fuzzies and Other People, these three novels constitute three possible outcomes for the Fuzzy "Trilogy" which is itself only part of a larger Future History portrayed by Piper in four other novels:&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Four-Day-Planet/dp/1557429928/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Four-Day Planet</a>&nbsp;(1961),&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Uller-Uprising/dp/160312988X/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Uller Uprising</a>&nbsp;(1952),&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/The-Cosmic-Computer/dp/160312876X/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">The Cosmic Computer</a>&nbsp;(1958), and&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Space-Viking/dp/1603128751/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Space Viking</a>&nbsp;(1962); and several short stories published between 1957 and 1962 and collected in two anthologies, Federation and Empire, edited by John F. Carr.']
Length: 1, dtype: string

That is a superise.

There are not too many words here, but there are a lot of HTML codes in this review. We don't want that.

Let's see what we can do.

In [ ]:
import re as re
text = re.sub('<[^<]+?>', '', str(demo_text))
text = re.sub('&nbsp', '', str(text))
text
Out[ ]:
'\n[\'H. Beam Piper\\\'s Fuzzy novels,;Little Fuzzy;(first published in 1962),;Fuzzy Sapiens;(originally published as;The Other Human Race;in 1964), and;Fuzzies and Other People;(first published in 1984), are perhaps the best treatment ever of the nature of intelligence in science-fiction.  The three novels deal with the assorted legal and political challenges which occur in the aftermath of the discovery of the Fuzzies--small, cute, furry humanoids--by human settlers on the planet Zarathustra.  Part crime drama, part space opera, Piper\\\'s novels remain a joy to read even though many of their early-1960\\\'s technological and cultural accouterments are a bit outdated.\\n\\nInterestingly, the third novel in the Fuzzies series, published posthumously, appeared after the publication of two "authorized" sequels penned by other authors: William Tuning\\\'s;Fuzzy Bones;(1981) and Ardath Mayhar\\\'s;Golden Dream: A Fuzzy Odyssey;(1982).  Along with Fuzzies and Other People, these three novels constitute three possible outcomes for the Fuzzy "Trilogy" which is itself only part of a larger Future History portrayed by Piper in four other novels:;Four-Day Planet;(1961),;Uller Uprising;(1952),;The Cosmic Computer;(1958), and;Space Viking;(1962); and several short stories published between 1957 and 1962 and collected in two anthologies, Federation and Empire, edited by John F. Carr.\']\nLength: 1, dtype: string'

This is much better.

Hence, for all text items. We need to purge the HTML tags if there are any.

In [ ]:
raw_data_copy = raw_data.copy()
In [ ]:
raw_data_copy['reviewText'] = raw_data_copy['reviewText'].str.replace('<[^<]+?>', '')
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  """Entry point for launching an IPython kernel.
In [ ]:
raw_data_copy['reviewText'] = raw_data_copy['reviewText'].str.replace('&nbsp', '')

Let's run the report function again.

In [ ]:
reports_reviewText_purged = text_item_properties( raw_data_copy.loc[:, ['reviewText']]);
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()
In [ ]:
purging_index_2 = find_outliers(reports_reviewText_purged, 'reviewText_num_of_words')
IQR = 113.0 - 21.0 = 92.0
MAX = 251.0
Min is 0
Num of min outliers:  0
Num of max outliers:  108525
Num of negative outliers:  15
Num of the original data set's whole instance 1000000
Rate of purged data/total data 0.108525
In [ ]:
ax = mulitple_function_plots(data=reports_reviewText_purged.drop(purging_index_2), kde_type= False , plot_type="boxplot",data_type="number", fig_size=(15,7))
Those features will be plotted in  2  rows and  2 columns
Index(['reviewText_Text_length', 'reviewText_num_of_words',
       'reviewText_presence_non_alphanumeric', 'reviewText_stop_words_count'],
      dtype='object')
1 . Finish Rendering : reviewText_Text_length , used 0 millseconds
2 . Finish Rendering : reviewText_num_of_words , used 1 millseconds
3 . Finish Rendering : reviewText_presence_non_alphanumeric , used 2 millseconds
4 . Finish Rendering : reviewText_stop_words_count , used 3 millseconds

There are still a lot of very long Text length instances.

We need to investigate it further.

In [ ]:
report_data_purged_2 = reports_reviewText_purged.drop(purging_index_2)
In [ ]:
report_data_purged_2[report_data_purged_2['reviewText_Text_length'] > 1750].head(1)
Out[ ]:
reviewText_Text_length reviewText_num_of_words reviewText_presence_non_alphanumeric reviewText_stop_words_count
119954 1824 94 384 19
In [ ]:
max_print_out(True)
outlier_example = raw_data_copy.loc[report_data_purged_2[report_data_purged_2['reviewText_Text_length'] > 1750].head(1).index]
outlier_example.reviewText.values
Out[ ]:
<StringArray>
['Another great Wen Spencer book. A very fun adventure story with a resourceful young hero, determined to protect both his new family and birth family. Jerin Whistler has been raised unconventionally for his world. He reads, writes, knows self-defense, as well as tactics and strategy. His kind heart and bravery stand him in good stead as he faces and overcomes both moral and physical danger. Great read. Recommend to everyone who loves a great story.\n\nHere are some other great reads by Wen Spencer:\n\nhttp://www.amazon.com/Tinker-Elfhome-Book-Wen-Spencer-ebook/dp/B00AP9CJXC/ref=la_B001IQXNE0_1_1?s=books&ie=UTF8&qid=1459800416&sr=1-1\n\nhttp://www.amazon.com/Wolf-Who-Rules-Elfhome-Book-ebook/dp/B00AP91UDC/ref=la_B001IQXNE0_1_3?s=books&ie=UTF8&qid=1459800416&sr=1-3\n\nhttp://www.amazon.com/Elfhome-Wen-Spencer-ebook/dp/B00APADQ0Q/ref=la_B001IQXNE0_1_4?s=books&ie=UTF8&qid=1459800416&sr=1-4\n\nhttp://www.amazon.com/Wood-Sprites-Elfhome-Book-4-ebook/dp/B00MRZ0JNO/ref=la_B001IQXNE0_1_2?s=books&ie=UTF8&qid=1459800416&sr=1-2\n\nhttp://www.amazon.com/Blue-Sky-Elfhome-Wen-Spencer-ebook/dp/B008E9HVDS/ref=la_B001IQXNE0_1_12?s=books&ie=UTF8&qid=1459800416&sr=1-12\n\nhttp://www.amazon.com/Wyvern-Elfhome-Wen-Spencer-ebook/dp/B008EACLYQ/ref=la_B001IQXNE0_1_10?s=books&ie=UTF8&qid=1459800416&sr=1-10\n\nhttp://www.amazon.com/Alien-Taste-Ukiah-Oregon-Book-ebook/dp/B000OIZU9E/ref=la_B001IQXNE0_1_6?s=books&ie=UTF8&qid=1459800416&sr=1-6\n\nhttp://www.amazon.com/Tainted-Trail-Ukiah-Oregon-Book-ebook/dp/B000OIZU94/ref=la_B001IQXNE0_1_8?s=books&ie=UTF8&qid=1459800416&sr=1-8\n\nhttp://www.amazon.com/Bitter-Waters-Ukiah-Oregon-Book-ebook/dp/B000OIZUBW/ref=la_B001IQXNE0_1_9?s=books&ie=UTF8&qid=1459800416&sr=1-9\n\nhttp://www.amazon.com/Dog-Warrior-Ukiah-Oregon-Book-ebook/dp/B000OIZUI0/ref=la_B001IQXNE0_1_11?s=books&ie=UTF8&qid=1459800416&sr=1-11']
Length: 1, dtype: string

Now, we have website links to delete.

In [ ]:
demo_text = str(outlier_example.reviewText.values)
import re as re
text = re.sub('http\S+', '', str(demo_text))
text
Out[ ]:
"<StringArray>\n['Another great Wen Spencer book. A very fun adventure story with a resourceful young hero, determined to protect both his new family and birth family. Jerin Whistler has been raised unconventionally for his world. He reads, writes, knows self-defense, as well as tactics and strategy. His kind heart and bravery stand him in good stead as he faces and overcomes both moral and physical danger. Great read. Recommend to everyone who loves a great story.\\n\\nHere are some other great reads by Wen Spencer:\\n\\n\nLength: 1, dtype: stringsdfsdfsdsfdsfsfsfsfdsfsfsfs"

Good enough.

2. Delete HTML tage and URL from our dataset

Now, it's time to generate a function to do so.

In [ ]:
#---------------clean_useless_information---------------

def clean_useless_information(data_df, columns = ['reviewText']):
  data = data_df.copy()
  for i in range(len(columns)):
    # clean html tag
    data[columns[i]] = data[(columns[i])].str.replace('<[^<]+?>', '')
    # clean &nbsp
    data[(columns[i])] = data[(columns[i])].str.replace('&nbsp', '')
    # clean http URL
    data[(columns[i])] = data[(columns[i])].str.replace('http\S+', '')
    # clean line breaker
    data[(columns[i])] = data[(columns[i])].str.replace('\n', '')
    return data
In [ ]:
raw_data_clean = clean_useless_information(raw_data)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:7: FutureWarning: The default value of regex will change from True to False in a future version.
  import sys
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:11: FutureWarning: The default value of regex will change from True to False in a future version.
  # This is added back by InteractiveShellApp.init_path()

Let's find out whether we did a right thing.

Original problematic instance

In [ ]:
# original dataset with problematic instance 
# HTML TAG instance
raw_data.iloc[156379,7]
Out[ ]:
'H. Beam Piper\'s Fuzzy novels,&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Little-Fuzzy/dp/159818797X/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Little Fuzzy</a>&nbsp;(first published in 1962),&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Fuzzy-Sapiens/dp/B000EG6BA8/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Fuzzy Sapiens</a>&nbsp;(originally published as&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/The-Other-Human-Race/dp/B000H0O5CC/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">The Other Human Race</a>&nbsp;in 1964), and&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Fuzzies-and-Other-People/dp/B000ENMVTQ/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Fuzzies and Other People</a>&nbsp;(first published in 1984), are perhaps the best treatment ever of the nature of intelligence in science-fiction.  The three novels deal with the assorted legal and political challenges which occur in the aftermath of the discovery of the Fuzzies--small, cute, furry humanoids--by human settlers on the planet Zarathustra.  Part crime drama, part space opera, Piper\'s novels remain a joy to read even though many of their early-1960\'s technological and cultural accouterments are a bit outdated.\n\nInterestingly, the third novel in the Fuzzies series, published posthumously, appeared after the publication of two "authorized" sequels penned by other authors: William Tuning\'s&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Fuzzy-Bones/dp/0441261825/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Fuzzy Bones</a>&nbsp;(1981) and Ardath Mayhar\'s&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Golden-Dream-A-Fuzzy-Odyssey/dp/0441297269/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Golden Dream: A Fuzzy Odyssey</a>&nbsp;(1982).  Along with Fuzzies and Other People, these three novels constitute three possible outcomes for the Fuzzy "Trilogy" which is itself only part of a larger Future History portrayed by Piper in four other novels:&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Four-Day-Planet/dp/1557429928/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Four-Day Planet</a>&nbsp;(1961),&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Uller-Uprising/dp/160312988X/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Uller Uprising</a>&nbsp;(1952),&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/The-Cosmic-Computer/dp/160312876X/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">The Cosmic Computer</a>&nbsp;(1958), and&nbsp;<a data-hook="product-link-linked" class="a-link-normal" href="/Space-Viking/dp/1603128751/ref=cm_cr_arp_d_rvw_txt?ie=UTF8">Space Viking</a>&nbsp;(1962); and several short stories published between 1957 and 1962 and collected in two anthologies, Federation and Empire, edited by John F. Carr.'

Cleaned instance

In [ ]:
raw_data_clean.iloc[156379,7]
Out[ ]:
'H. Beam Piper\'s Fuzzy novels,;Little Fuzzy;(first published in 1962),;Fuzzy Sapiens;(originally published as;The Other Human Race;in 1964), and;Fuzzies and Other People;(first published in 1984), are perhaps the best treatment ever of the nature of intelligence in science-fiction.  The three novels deal with the assorted legal and political challenges which occur in the aftermath of the discovery of the Fuzzies--small, cute, furry humanoids--by human settlers on the planet Zarathustra.  Part crime drama, part space opera, Piper\'s novels remain a joy to read even though many of their early-1960\'s technological and cultural accouterments are a bit outdated.\n\nInterestingly, the third novel in the Fuzzies series, published posthumously, appeared after the publication of two "authorized" sequels penned by other authors: William Tuning\'s;Fuzzy Bones;(1981) and Ardath Mayhar\'s;Golden Dream: A Fuzzy Odyssey;(1982).  Along with Fuzzies and Other People, these three novels constitute three possible outcomes for the Fuzzy "Trilogy" which is itself only part of a larger Future History portrayed by Piper in four other novels:;Four-Day Planet;(1961),;Uller Uprising;(1952),;The Cosmic Computer;(1958), and;Space Viking;(1962); and several short stories published between 1957 and 1962 and collected in two anthologies, Federation and Empire, edited by John F. Carr.'

NO HTML TAG and nbsp. GOOD.

Original problematic instance

In [ ]:
# original dataset with problematic instance 
# Long URL
raw_data.iloc[119954,7]
Out[ ]:
'Another great Wen Spencer book. A very fun adventure story with a resourceful young hero, determined to protect both his new family and birth family. Jerin Whistler has been raised unconventionally for his world. He reads, writes, knows self-defense, as well as tactics and strategy. His kind heart and bravery stand him in good stead as he faces and overcomes both moral and physical danger. Great read. Recommend to everyone who loves a great story.\n\nHere are some other great reads by Wen Spencer:\n\nhttp://www.amazon.com/Tinker-Elfhome-Book-Wen-Spencer-ebook/dp/B00AP9CJXC/ref=la_B001IQXNE0_1_1?s=books&ie=UTF8&qid=1459800416&sr=1-1\n\nhttp://www.amazon.com/Wolf-Who-Rules-Elfhome-Book-ebook/dp/B00AP91UDC/ref=la_B001IQXNE0_1_3?s=books&ie=UTF8&qid=1459800416&sr=1-3\n\nhttp://www.amazon.com/Elfhome-Wen-Spencer-ebook/dp/B00APADQ0Q/ref=la_B001IQXNE0_1_4?s=books&ie=UTF8&qid=1459800416&sr=1-4\n\nhttp://www.amazon.com/Wood-Sprites-Elfhome-Book-4-ebook/dp/B00MRZ0JNO/ref=la_B001IQXNE0_1_2?s=books&ie=UTF8&qid=1459800416&sr=1-2\n\nhttp://www.amazon.com/Blue-Sky-Elfhome-Wen-Spencer-ebook/dp/B008E9HVDS/ref=la_B001IQXNE0_1_12?s=books&ie=UTF8&qid=1459800416&sr=1-12\n\nhttp://www.amazon.com/Wyvern-Elfhome-Wen-Spencer-ebook/dp/B008EACLYQ/ref=la_B001IQXNE0_1_10?s=books&ie=UTF8&qid=1459800416&sr=1-10\n\nhttp://www.amazon.com/Alien-Taste-Ukiah-Oregon-Book-ebook/dp/B000OIZU9E/ref=la_B001IQXNE0_1_6?s=books&ie=UTF8&qid=1459800416&sr=1-6\n\nhttp://www.amazon.com/Tainted-Trail-Ukiah-Oregon-Book-ebook/dp/B000OIZU94/ref=la_B001IQXNE0_1_8?s=books&ie=UTF8&qid=1459800416&sr=1-8\n\nhttp://www.amazon.com/Bitter-Waters-Ukiah-Oregon-Book-ebook/dp/B000OIZUBW/ref=la_B001IQXNE0_1_9?s=books&ie=UTF8&qid=1459800416&sr=1-9\n\nhttp://www.amazon.com/Dog-Warrior-Ukiah-Oregon-Book-ebook/dp/B000OIZUI0/ref=la_B001IQXNE0_1_11?s=books&ie=UTF8&qid=1459800416&sr=1-11'

Cleaned instance

In [ ]:
raw_data_clean.iloc[119954,7]
Out[ ]:
'Another great Wen Spencer book. A very fun adventure story with a resourceful young hero, determined to protect both his new family and birth family. Jerin Whistler has been raised unconventionally for his world. He reads, writes, knows self-defense, as well as tactics and strategy. His kind heart and bravery stand him in good stead as he faces and overcomes both moral and physical danger. Great read. Recommend to everyone who loves a great story.Here are some other great reads by Wen Spencer:'

NO URL anymore. Good!

3. Now we generate the reports with boxplot

Note, such a bad idea to add feature name into result column name.
I don't have time to rerun codes from 1.a.2 Text item Properties.
So we changed our function from now on.

In [ ]:
#--------------text_item_properties---------------#
'''We want to save all the results to a new dataframe'''
def text_item_properties(data):
  result = pd.DataFrame()
  data = data.copy()
  # we just fill na with 0 here. Without doing so, there will be an error
  data = data.fillna('0')
  for i in range(len(data.columns)):
    # get character length
    result['Text_length'] = data[str(data.columns[i])].str.len() 
    # get number of words
    result['num_of_words'] = data[str(data.columns[i])].str.split().str.len()
    # get non alphanumeric number
    result['presence_non_alphanumeric'] = data[str(data.columns[i])].str.replace('[a-zA-Z0-9]', '').str.len()
    # get stop words account
    result['stop_words_count'] = data[str(data.columns[i])].str.split().apply(lambda x: len(set(x) & stop_words)) 
  return result 
In [ ]:
#---------------clean_useless_information---------------
def show_purged_reports(data_df, parameter = ['reviewText'], output_type = 'num_of_words'):
  data = data_df.copy() # get the copy
  # get our reports
  reports = text_item_properties( data.loc[:, parameter]);
  # find outliers
  index = find_outliers(reports, output_type);
  # plot the results
  ax = mulitple_function_plots(data=reports.drop(index), kde_type = False, plot_type="histogram",data_type="number", fig_size=(15,7),tight_layout=False)
  ax = mulitple_function_plots(data=reports.drop(index), kde_type= False , plot_type="boxplot",data_type="number", fig_size=(15,7) , tight_layout=False);
  return reports, index
In [ ]:
reports, index = show_purged_reports(raw_data_clean)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:14: FutureWarning: The default value of regex will change from True to False in a future version.
  
IQR = 112.0 - 21.0 = 91.0
MAX = 248.5
Min is 0
Num of min outliers:  0
Num of max outliers:  108535
Num of negative outliers:  27
Num of the original data set's whole instance 1000000
Rate of purged data/total data 0.108535
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 0 millseconds
2 . Finish Rendering : num_of_words , used 0 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : stop_words_count , used 0 millseconds
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 0 millseconds
2 . Finish Rendering : num_of_words , used 1 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 2 millseconds
4 . Finish Rendering : stop_words_count , used 2 millseconds

We still have outliers, but let's see if it's reasonable.

In [ ]:
max_print_out(True)
outlier_example = raw_data_clean.loc[reports[reports['reviewText_Text_length'] > 1750].head(1).index]
outlier_example.reviewText.values
Out[ ]:
<StringArray>
["This book is a stand alone read as some have mentioned in their reviews. To really understand the whole dynamic behind the characters; you should really read the other books in the series.  I wondered how Dane was going to tell Raven' s story because there was just such a strong energy bouncing off Raven. Her feelings were raw in the Brown Felt books and the Dinner Party Girls. I felt like Lauren really needed to get into Raven's Psyche and give Raven's voice as well as her emotions a chance.Lauren did it.  She wrote Raven as the edgy bitch some of us had become curios about.  Raven had been independent and alone for the majority of  her life. Letting people in her life felt wrong and placed her in a very vulnerable position.  She had never felt LOVED and cherished until she started a relationship with her bosses sister. Erin befriended Raven immediately and it was the relationship that began preparing Raven for the real deal, Jonah Warner. Jonah Warner, Levi' s brother, could not and would not be denied. It took Jonah patiently peeling Raven' s defenses away like a blossoming onion before she finally completely submitted.Their relationship finally opened up as Raven starts letting go and trusting that Jonah is definitely not a hot fling.Raven eventually is forced to confront her demons and luckily she doesn't have to do it alone. Answers to questions she had been struggling to know finally were resolved.  Anger she was consumed with began to dissolve, and the closure she needed concerning her family finally happened. She was able to freely fall in love.Dane never fails to let her fans look into the lives of the other characters. Once again we get to reconnect  with the rest of the family. The Brown Family and the Dinner Party Girls all come together to take care of Raven."]
Length: 1, dtype: string

Quite reasonbale. Then we are done, here. No further investigation of this data quality issue.

2. Summary

We rerun the above codes to see if there is a problem in summary.

In [ ]:
reports, index = show_purged_reports(raw_data_clean, parameter=['summary'])
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:14: FutureWarning: The default value of regex will change from True to False in a future version.
  
IQR = 6.0 - 2.0 = 4.0
MAX = 12.0
Min is 0
Num of min outliers:  0
Num of max outliers:  37431
Num of negative outliers:  1
Num of the original data set's whole instance 1000000
Rate of purged data/total data 0.037431
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 0 millseconds
2 . Finish Rendering : num_of_words , used 0 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : stop_words_count , used 0 millseconds
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 0 millseconds
2 . Finish Rendering : num_of_words , used 1 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 2 millseconds
4 . Finish Rendering : stop_words_count , used 2 millseconds
In [ ]:
max_print_out(True)
outlier_example = raw_data_clean.loc[reports[reports['Text_length'] > 120].head(1).index]
outlier_example.reviewText.values
Out[ ]:
<StringArray>
['Paid for two and only ordered one. I have been trying to return the 2nd one and got no response so I will have to keep checking to make sure I did not get charged again for two books but it was a great purchase!']
Length: 1, dtype: string

Well, the longest outlier in summary is just a normal sentence.

We can called it a day for the data quality issue of text items.

3. NaN values

In [ ]:
raw_data_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 12 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   overall         1000000 non-null  Int64  
 1   verified        1000000 non-null  boolean
 2   reviewTime      1000000 non-null  string 
 3   reviewerID      1000000 non-null  string 
 4   asin            1000000 non-null  string 
 5   style           994508 non-null   string 
 6   reviewerName    999938 non-null   string 
 7   reviewText      999867 non-null   string 
 8   summary         999859 non-null   string 
 9   unixReviewTime  1000000 non-null  Int64  
 10  vote            217191 non-null   object 
 11  image           1559 non-null     string 
dtypes: Int64(2), boolean(1), object(1), string(8)
memory usage: 87.7+ MB
In [ ]:
raw_data_clean[raw_data_clean['reviewText'].isna()].head(5)
Out[ ]:
overall verified reviewTime reviewerID asin style reviewerName reviewText summary unixReviewTime vote image
1521 5 False 07 19, 2017 AZWSL7MRCSAQF 0399159347 {'Format:': ' Hardcover'} Katandra Jackson Nunnally <NA> It Matters 1500422400 NaN ['https://images-na.ssl-images-amazon.com/imag...
2924 5 True 02 24, 2017 A3EME7G0VAOKV2 0451226119 {'Format:': ' Paperback'} Yolanda Jones <NA> Five Stars 1487894400 NaN <NA>
8053 5 True 05 28, 2015 A207U7VN8R2JWJ 0380791714 {'Format:': ' Paperback'} abel osuna <NA> Five Stars 1432771200 NaN <NA>
11051 5 True 04 30, 2016 A1SCSXMXJVMVCU 0451475518 {'Format:': ' Kindle Edition'} Kristen <NA> Five Stars 1461974400 NaN <NA>
26905 3 True 04 15, 2016 A3JEFNA3EF5ZIW 0486270602 {'Format:': ' Paperback'} Raven <NA> Three Stars 1460678400 NaN <NA>

Without reviewText, it's very hard to predict the overall score. We delete all instnace with NaN reviewText first.

In [ ]:
#--------------purge_NaN-------------------
def purge_NaN(data_df):
  data = data_df.copy()
  data = data.drop(data[data['reviewText'].isna()].index)
  return data
In [ ]:
new_data_without_nan = purge_NaN(raw_data_clean)
In [ ]:
new_data_without_nan.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999867 entries, 0 to 999999
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   overall         999867 non-null  Int64  
 1   verified        999867 non-null  boolean
 2   reviewTime      999867 non-null  string 
 3   reviewerID      999867 non-null  string 
 4   asin            999867 non-null  string 
 5   style           994375 non-null  string 
 6   reviewerName    999805 non-null  string 
 7   reviewText      999867 non-null  string 
 8   summary         999733 non-null  string 
 9   unixReviewTime  999867 non-null  Int64  
 10  vote            217187 non-null  object 
 11  image           1550 non-null    string 
dtypes: Int64(2), boolean(1), object(1), string(8)
memory usage: 95.4+ MB

We take a look at style's NaN looks like.

In [ ]:
raw_data_clean[raw_data_clean['style'].isna()].head(5)
Out[ ]:
overall verified reviewTime reviewerID asin style reviewerName reviewText summary unixReviewTime vote image
43 5 False 07 16, 2014 A32M1YZPIX7UK 0385489129 <NA> Jeannette Speck Thanks Five Stars 1405468800 NaN <NA>
108 4 False 04 4, 2016 A2OJW07GQRNJUT 039304050X <NA> Steven H Propp Author and former prosecutor Vincent Bugliosi ... THE MANSON PROSECUTOR TURNED WRITER LOOKS AT T... 1459728000 NaN <NA>
152 5 True 08 25, 2015 A2HIDWUPXXZAOR 0394551907 <NA> Pat I only recently heard about this Dr. Seuss boo... I only recently heard about this Dr. Seuss boo... 1440460800 NaN <NA>
256 5 False 08 8, 2010 AHD101501WCN1 0445083808 <NA> Shalom Freedman Cornelius Ryan does something unusual in this ... Outstanding description of the Normandy Invasion 1281225600 3 <NA>
351 5 True 08 18, 2014 AV1OM7Z698LL5 0451468740 <NA> Kathleen Freeman Truly an engaging story about an amazing event... The end of Osama 1408320000 2 <NA>

And the normal style instances

In [ ]:
raw_data_clean['style'].head(5)
Out[ ]:
0    {'Format:': ' Kindle Edition'}
1         {'Format:': ' Hardcover'}
2    {'Format:': ' Kindle Edition'}
3         {'Format:': ' Paperback'}
4    {'Format:': ' Kindle Edition'}
Name: style, dtype: string
In [ ]:
raw_data_clean['style'].value_counts().head()
Out[ ]:
{'Format:': ' Kindle Edition'}           485256
{'Format:': ' Paperback'}                205399
{'Format:': ' Hardcover'}                181883
{'Format:': ' Mass Market Paperback'}     95746
{'Format:': ' Board book'}                 9241
Name: style, dtype: Int64
In [ ]:
len(raw_data_clean[raw_data_clean['style'].isna()])
Out[ ]:
5492

This feature is not that important. And we can just add one kind of book style on the NaN values.

Or 5492 is really not too much instances, we can just drop it.

Let's see how many NaN with summary.

In [ ]:
len(raw_data_clean[raw_data_clean['summary'].isna()])
Out[ ]:
141
In [ ]:
raw_data_clean[raw_data_clean['summary'].isna()].head(5)
Out[ ]:
overall verified reviewTime reviewerID asin style reviewerName reviewText summary unixReviewTime vote image
5752 1 False 01 17, 2018 A35YSGG9NGX2Y8 0385543026 {'Format:': ' Hardcover'} ReadsLots Starts off nicely: great caper, swiping manus... <NA> 1516147200 NaN <NA>
6007 5 False 06 23, 2016 A3B3SJL6Y9GAPU 0375990690 {'Format:': ' Kindle Edition'} mjc I loved this book! This book has been amazing,... <NA> 1466640000 2 <NA>
11853 5 True 01 26, 2017 ARCS9S0H1Y7VB 0425270696 {'Format:': ' Kindle Edition'} Sarah Mack I love this book, and this series. The authors... <NA> 1485388800 NaN <NA>
23638 5 True 01 11, 2017 A2HDK6XBOV1BI1 0373284241 {'Format:': ' Kindle Edition'} Bailey Have enjoyed reading all the books in this ser... <NA> 1484092800 NaN <NA>
32116 5 True 09 5, 2016 ACV8DKJK8R328 0425263223 {'Format:': ' Kindle Edition'} Angie Love the series. If your missing Sookie's worl... <NA> 1473033600 NaN <NA>

Since we have almost 1 million instances, we can just drop those 141 instances.

1.2.2 Data quality Plan

In [ ]:
raw_data_clean.columns
Out[ ]:
Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style',
       'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'vote',
       'image'],
      dtype='object')

About NaN values. We drop all instances with NaN values.

Continuous Feature Data Quality Issue Potential Handling Strategies
overall The ratio of 5 different scores are different Need to stratified sample the dataset before training.
unixReviewTime a time series value but in Int convert to time series and then put into days
verified boolean value chagne to int

Categorical Feature Data Quality Issue Potential Handling Strategies
reviewTime a time series value but in string convert to time series and then put into days
reviewerID Not much useful Drop
asin ID number for books Can't get the book name so far, have to drop it
style has useless non-aplhanumeric characters need to purge those characters
reviewerName not very useful Drop the column
reviewText Has outliers/html tag/URL/useless symbols Purge outliers and delete useless information
summary No problem for now No further investigation
vote TOO many NaN values Drop column
image Too many NaN values Drop column

1.3. Preprocess your data according to the data quality plan.

1. Drop vote and image and NaN instances

In [ ]:
#--------------purge_NaN-------------------
def purge_NaN(data_df):
  data = data_df.copy()
  # drop vote and image
  data = data.drop(['vote','image'], axis = 1)
  # drop NaN values
  for i in range(len(data.columns)):
    data = data.drop(data[data[str(data.columns[i])].isna()].index)
  return data
In [ ]:
raw_data_copy = raw_data.copy()
raw_data_copy = purge_NaN(raw_data_copy)
In [ ]:
raw_data_copy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 994181 entries, 0 to 999999
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   overall         994181 non-null  Int64  
 1   verified        994181 non-null  boolean
 2   reviewTime      994181 non-null  string 
 3   reviewerID      994181 non-null  string 
 4   asin            994181 non-null  string 
 5   style           994181 non-null  string 
 6   reviewerName    994181 non-null  string 
 7   reviewText      994181 non-null  string 
 8   summary         994181 non-null  string 
 9   unixReviewTime  994181 non-null  Int64  
dtypes: Int64(2), boolean(1), string(7)
memory usage: 111.9 MB

2. Purge outliers

We only need to use review Text to purge outliers

In [ ]:
#---------------clean_useless_information---------------
def purge_outliers(data_df, parameter = 'reviewText', output_type = 'num_of_words'):
  data = data_df.copy() # get the copy
  # find outliers
  result = pd.DataFrame()
  result[output_type] = data[parameter].str.split().str.len()
  index = find_outliers(result, output_type);
  # plot the results
  return data.drop(index)
In [ ]:
raw_data_copy = purge_outliers(raw_data_copy)
IQR = 112.0 - 21.0 = 91.0
MAX = 248.5
Min is 0
Num of min outliers:  0
Num of max outliers:  109577
Num of negative outliers:  0
Num of the original data set's whole instance 994181
Rate of purged data/total data 0.11021836064056746
In [ ]:
raw_data_copy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884604 entries, 0 to 999999
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   overall         884604 non-null  Int64  
 1   verified        884604 non-null  boolean
 2   reviewTime      884604 non-null  string 
 3   reviewerID      884604 non-null  string 
 4   asin            884604 non-null  string 
 5   style           884604 non-null  string 
 6   reviewerName    884604 non-null  string 
 7   reviewText      884604 non-null  string 
 8   summary         884604 non-null  string 
 9   unixReviewTime  884604 non-null  Int64  
dtypes: Int64(2), boolean(1), string(7)
memory usage: 70.9 MB

3. Change boolean to int

In [ ]:
raw_data_copy["verified"] = raw_data_copy["verified"].astype(int)
In [ ]:
raw_data_copy.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884604 entries, 0 to 999999
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   overall         884604 non-null  Int64 
 1   verified        884604 non-null  int64 
 2   reviewTime      884604 non-null  string
 3   reviewerID      884604 non-null  string
 4   asin            884604 non-null  string
 5   style           884604 non-null  string
 6   reviewerName    884604 non-null  string
 7   reviewText      884604 non-null  string
 8   summary         884604 non-null  string
 9   unixReviewTime  884604 non-null  Int64 
dtypes: Int64(2), int64(1), string(7)
memory usage: 75.9 MB

4. clean non-alphanumeric character in each feature.

First we take a look at the style with long length

In [ ]:
# instances with long style length
raw_data_copy[raw_data_copy['style'].str.len() > 30].head()
Out[ ]:
overall verified reviewTime reviewerID asin style reviewerName reviewText summary unixReviewTime
19 3 0 08 28, 2002 A3UHCUZ5H4N0TH 0399138994 {'Format:': ' Mass Market Paperback'} Darren Jacks Didn't King write something along these lines;... Average... 1030492800
61 3 1 06 22, 2014 A1TCJV0HALJE69 037543075X {'Format:': ' Mass Market Paperback'} Chantelle This book was fine. It is not The Notebook, De... Just Okay 1403395200
77 3 0 02 1, 2005 A1DMOR9TUO13I1 0394575288 {'Format:': ' Mass Market Paperback'} Don M Howard I liked Pete Hamill's "Loving Women" and found... Good story, but anachronisms abound 1107216000
88 4 0 01 30, 2017 AQH1ODDV4HKLI 0425258947 {'Format:': ' Mass Market Paperback'} des So although I love this series, I wasn't a fan... So although I love this series 1485734400
91 5 0 03 10, 2014 A2WP2OBWXIN78C 0451418336 {'Format:': ' Mass Market Paperback'} Sara Gerhold Catherine Anderson has done it again! She brin... Another hit by Catherine Anderson! 1394409600

We can see in style, we need to remove 'Format' and Punctuation.

In [ ]:
raw_data_copy['style'] = raw_data_copy['style'].str.replace('[^\w\s]+', '')
raw_data_copy['style'] = raw_data_copy['style'].str.replace('Format', '')
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  """Entry point for launching an IPython kernel.
In [ ]:
raw_data_copy.head()
Out[ ]:
overall verified reviewTime reviewerID asin style reviewerName reviewText summary unixReviewTime
0 2 0 12 4, 2015 A273QRPDN6IQC8 0446676101 Kindle Edition Rub Chicken Book starts out with some really interesting i... with some really interesting ideas and gets c... 1449187200
1 5 1 04 24, 2016 A31Q39MDPVBTSX 0451473019 Hardcover BluegrassAnne was a gift for someone he loved it he loved 1461456000
2 4 1 10 30, 2014 A353XVWAOOUCQS 0385352107 Kindle Edition Reader in the Pacific I have read a number of Murakami novels and th... Hollow Man 1414627200
4 5 0 08 15, 2015 A2XS90TMQ26YYH 0399146695 Kindle Edition J Williams Great story Highly recommended Five Stars 1439596800
5 3 1 04 5, 2014 A1W2NS6QC29RZM 0385519311 Hardcover Noraxpat I passed this book on to someone who was more ... OK FOR INTRO TO PERSONAL FINANCE 1396656000

Nice that is what we want. Then we delete any punctuation in other text features.

In [ ]:
raw_data_copy['reviewerName'] = raw_data_copy['reviewerName'].str.replace('[^\w\s]+', '')
raw_data_copy['reviewText'] = raw_data_copy['reviewText'].str.replace('[^\w\s]+', '')
raw_data_copy['summary'] = raw_data_copy['summary'].str.replace('[^\w\s]+', '')
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  """Entry point for launching an IPython kernel.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: FutureWarning: The default value of regex will change from True to False in a future version.
  
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3: FutureWarning: The default value of regex will change from True to False in a future version.
  This is separate from the ipykernel package so we can avoid doing imports until
In [ ]:
raw_data_copy.head()
Out[ ]:
overall verified reviewTime reviewerID asin style reviewerName reviewText summary unixReviewTime
0 2 0 12 4, 2015 A273QRPDN6IQC8 0446676101 Kindle Edition Rub Chicken Book starts out with some really interesting i... with some really interesting ideas and gets c... 1449187200
1 5 1 04 24, 2016 A31Q39MDPVBTSX 0451473019 Hardcover BluegrassAnne was a gift for someone he loved it he loved 1461456000
2 4 1 10 30, 2014 A353XVWAOOUCQS 0385352107 Kindle Edition Reader in the Pacific I have read a number of Murakami novels and th... Hollow Man 1414627200
4 5 0 08 15, 2015 A2XS90TMQ26YYH 0399146695 Kindle Edition J Williams Great story Highly recommended Five Stars 1439596800
5 3 1 04 5, 2014 A1W2NS6QC29RZM 0385519311 Hardcover Noraxpat I passed this book on to someone who was more ... OK FOR INTRO TO PERSONAL FINANCE 1396656000

Let's run our reports function to see if there is any punctuation

In [ ]:
reports = show_reports(raw_data_copy, parameter=['summary'])
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 0 millseconds
2 . Finish Rendering : num_of_words , used 0 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : stop_words_count , used 0 millseconds
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 0 millseconds
2 . Finish Rendering : num_of_words , used 1 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 2 millseconds
4 . Finish Rendering : stop_words_count , used 3 millseconds

We can see there is no non aplhanumeric value anymore.

That is all we need. The useless features, we will delete in the transformer.

5. Now Let's write our transformer.

In [ ]:
#------------- main transformer ---------------------
# Class for attribute transformer
# import important libray
from sklearn.base import BaseEstimator, TransformerMixin

class combined_attribute_adder_and_cleaner(BaseEstimator, TransformerMixin):
    '''data clean transfomer class'''
    
    def __init__(self, data_cleaner = True, servies_remainer = False, normalization = True): # no *args or **kargs
        # we need to set extra var to ensure do we need to purge the dataset. 
        # In my following experments, sometimes we don't need to do so. 
        self.data_cleaner = data_cleaner
        self.servies_remainer = servies_remainer
        self.normalization = normalization

    def fit(self, X, y=None):
        return self # nothing else to do

    def transform(self, data_df):
        # we first copy the data from our dataset.
        # operate on original data set sometimes is dangerous.
        X = data_df.copy()

        #0. drop NaN values
        # drop vote and image
        X = X.drop(['vote','image'], axis = 1)
        # drop NaN values
        for i in range(len(X.columns)):
          X = X.drop(X[X[str(X.columns[i])].isna()].index)

        # 1. First we change the feature verified with to integer
        X["verified"] = X["verified"].astype(int)

        # 2. purge outliers
        X = purge_outliers(X)

        # 3. drop all useless features and categorical features we alreayd transfered
        X = X.drop(['reviewerID','reviewTime', 'asin', 'unixReviewTime'],axis=1) 

        # 4. delete HTML tag and other useless characters
        X = clean_useless_information(X)

        # 5. clean non alphanumeric data
        X['style'] = X['style'].str.replace('[^\w\s]+', '')
        X['style'] = X['style'].str.replace('Format', '')
        X['reviewerName'] = X['reviewerName'].str.replace('[^\w\s]+', '')
        X['reviewText'] = X['reviewText'].str.replace('[^\w\s]+', '')
        X['summary'] = X['summary'].str.replace('[^\w\s]+', '')

        # we put our target value at the end
        target = X.pop('overall')
        X['score'] = target


        return X
In [ ]:
raw_data_copy = raw_data.copy()
attr_adder_and_cleaner = combined_attribute_adder_and_cleaner()
purged_data = attr_adder_and_cleaner.transform(raw_data_copy)
IQR = 112.0 - 21.0 = 91.0
MAX = 248.5
Min is 0
Num of min outliers:  0
Num of max outliers:  109577
Num of negative outliers:  0
Num of the original data set's whole instance 994181
Rate of purged data/total data 0.11021836064056746
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:26: FutureWarning: The default value of regex will change from True to False in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:30: FutureWarning: The default value of regex will change from True to False in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:44: FutureWarning: The default value of regex will change from True to False in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:46: FutureWarning: The default value of regex will change from True to False in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:47: FutureWarning: The default value of regex will change from True to False in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:48: FutureWarning: The default value of regex will change from True to False in a future version.
In [ ]:
purged_data.head()
Out[ ]:
verified style reviewerName reviewText summary score
0 0 Kindle Edition Rub Chicken Book starts out with some really interesting i... with some really interesting ideas and gets c... 2
1 1 Hardcover BluegrassAnne was a gift for someone he loved it he loved 5
2 1 Kindle Edition Reader in the Pacific I have read a number of Murakami novels and th... Hollow Man 4
4 0 Kindle Edition J Williams Great story Highly recommended Five Stars 5
5 1 Hardcover Noraxpat I passed this book on to someone who was more ... OK FOR INTRO TO PERSONAL FINANCE 3
In [ ]:
purged_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884604 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   verified      884604 non-null  int64 
 1   style         884604 non-null  string
 2   reviewerName  884604 non-null  string
 3   reviewText    884604 non-null  string
 4   summary       884604 non-null  string
 5   score         884604 non-null  Int64 
dtypes: Int64(1), int64(1), string(4)
memory usage: 48.1 MB

Create a pipe line.

In [ ]:
#############################PIPE LINE###################################################

# Now we build a transformer to get all the above steps
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


# convert_pipeline is for create a whole pipeline but remain the dataFrame structure
convert_pipeline = Pipeline([
        ('attribs_adder_cleaner', combined_attribute_adder_and_cleaner(data_cleaner=True)),
    ])

Test our pipeline.

In [ ]:
converted_data = convert_pipeline.fit_transform(raw_data.copy())
IQR = 112.0 - 21.0 = 91.0
MAX = 248.5
Min is 0
Num of min outliers:  0
Num of max outliers:  109577
Num of negative outliers:  0
Num of the original data set's whole instance 994181
Rate of purged data/total data 0.11021836064056746
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:7: FutureWarning: The default value of regex will change from True to False in a future version.
  import sys
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:11: FutureWarning: The default value of regex will change from True to False in a future version.
  # This is added back by InteractiveShellApp.init_path()
In [ ]:
converted_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884604 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   verified      884604 non-null  int64 
 1   style         884604 non-null  string
 2   reviewerName  884604 non-null  string
 3   reviewText    884604 non-null  string
 4   summary       884604 non-null  string
 5   score         884604 non-null  Int64 
dtypes: Int64(1), int64(1), string(4)
memory usage: 48.1 MB

Let's draw the report.

In [ ]:
#---------------clean_useless_information---------------
def show_reports(data_df, parameter = ['reviewText'], output_type = 'num_of_words'):
  data = data_df.copy() # get the copy
  # get our reports
  reports = text_item_properties( data.loc[:, parameter]);
  # plot the results
  ax = mulitple_function_plots(data=reports, kde_type = False, plot_type="histogram",data_type="number", fig_size=(15,7),tight_layout=False)
  ax = mulitple_function_plots(data=reports, kde_type= False , plot_type="boxplot",data_type="number", fig_size=(15,7) , tight_layout=False);
  return reports

Draw graph on reviewText

In [ ]:
reports = show_reports(converted_data)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 0 millseconds
2 . Finish Rendering : num_of_words , used 0 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : stop_words_count , used 1 millseconds
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 2 millseconds
2 . Finish Rendering : num_of_words , used 3 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 4 millseconds
4 . Finish Rendering : stop_words_count , used 5 millseconds

Draw graph on summary.

In [ ]:
reports = show_reports(converted_data,parameter=['summary'])
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 0 millseconds
2 . Finish Rendering : num_of_words , used 0 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : stop_words_count , used 0 millseconds
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 0 millseconds
2 . Finish Rendering : num_of_words , used 1 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 1 millseconds
4 . Finish Rendering : stop_words_count , used 2 millseconds

Draw graph on style.

In [ ]:
reports = show_reports(converted_data,parameter=['style'])
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: FutureWarning: The default value of regex will change from True to False in a future version.
  app.launch_new_instance()
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 0 millseconds
2 . Finish Rendering : num_of_words , used 0 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 0 millseconds
4 . Finish Rendering : stop_words_count , used 0 millseconds
Those features will be plotted in  2  rows and  2 columns
Index(['Text_length', 'num_of_words', 'presence_non_alphanumeric',
       'stop_words_count'],
      dtype='object')
1 . Finish Rendering : Text_length , used 0 millseconds
2 . Finish Rendering : num_of_words , used 1 millseconds
3 . Finish Rendering : presence_non_alphanumeric , used 1 millseconds
4 . Finish Rendering : stop_words_count , used 2 millseconds

That is what we want.

1.4. Answer the following questions:

Before we doing those questions, we need to alter our dataset a bit.

1.4.0.1 Data Preparation

In [ ]:
converted_data = convert_pipeline.fit_transform(raw_data.copy())
IQR = 112.0 - 21.0 = 91.0
MAX = 248.5
Min is 0
Num of min outliers:  0
Num of max outliers:  109577
Num of negative outliers:  0
Num of the original data set's whole instance 994181
Rate of purged data/total data 0.11021836064056746
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:101: FutureWarning: The default value of regex will change from True to False in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:105: FutureWarning: The default value of regex will change from True to False in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:432: FutureWarning: The default value of regex will change from True to False in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:434: FutureWarning: The default value of regex will change from True to False in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:435: FutureWarning: The default value of regex will change from True to False in a future version.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:436: FutureWarning: The default value of regex will change from True to False in a future version.
In [ ]:
converted_data.to_csv('purged_data.csv')
In [ ]:
converted_data.columns
Out[ ]:
Index(['verified', 'style', 'reviewerName', 'reviewText', 'summary', 'score'], dtype='object')
In [ ]:
converted_data.head()
Out[ ]:
verified style reviewerName reviewText summary score
0 0 Kindle Edition Rub Chicken Book starts out with some really interesting i... with some really interesting ideas and gets c... 2
1 1 Hardcover BluegrassAnne was a gift for someone he loved it he loved 5
2 1 Kindle Edition Reader in the Pacific I have read a number of Murakami novels and th... Hollow Man 4
4 0 Kindle Edition J Williams Great story Highly recommended Five Stars 5
5 1 Hardcover Noraxpat I passed this book on to someone who was more ... OK FOR INTRO TO PERSONAL FINANCE 3
In [ ]:
converted_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884604 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   verified      884604 non-null  int64 
 1   style         884604 non-null  string
 2   reviewerName  884604 non-null  string
 3   reviewText    884604 non-null  string
 4   summary       884604 non-null  string
 5   score         884604 non-null  Int64 
dtypes: Int64(1), int64(1), string(4)
memory usage: 48.1 MB
In [ ]:
converted_data = pd.read_csv('/content/drive/MyDrive/A3/purged_data.csv')
converted_data = converted_data.drop('Unnamed: 0', axis=1) # old index is useless now. drop it

Before we doing any analysis, we first need to remove the stop words.

In [ ]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
In [ ]:
len(stop_words)
Out[ ]:
179

We can see there are 179 stop words from nltk.corpus

In [ ]:
# print first 5
stop_words[:20]
Out[ ]:
['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

We will first transfer all charater to lower case and remove the stop words.

In [ ]:
#--------------remove_stop_words-------------
def remove_stop_words(data, stop_words):
  feature = data.select_dtypes(exclude="number").columns
  for i in range(len(feature)):
      print("Now it's removing stop words from ", feature[i])
      # remove stop words
      # first change all character to lower case
      data[feature[i]] = data[feature[i]].str.lower()
      data[feature[i]] = data[feature[i]].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
  return data 
In [ ]:
converted_data_no_stop_words = remove_stop_words(converted_data,stop_words)
Now it's removing stop words from  style
Now it's removing stop words from  reviewerName
Now it's removing stop words from  reviewText
Now it's removing stop words from  summary
In [ ]:
converted_data_copy = converted_data_no_stop_words.copy()
In [ ]:
converted_data_no_stop_words.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884604 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   verified      884604 non-null  int64 
 1   style         884604 non-null  object
 2   reviewerName  884604 non-null  object
 3   reviewText    884604 non-null  object
 4   summary       884604 non-null  object
 5   score         884604 non-null  Int64 
dtypes: Int64(1), int64(1), object(4)
memory usage: 48.1+ MB
In [ ]:
converted_data_no_stop_words.to_csv('no_stop_words_data.csv')
In [ ]:
# def a function to draw the Bar plot
#----------------plot_frequenct_words_bar-------------------
def plot_frequenct_words_bar(data, figsize = (15,10), name = 'style'):
  fig, ax = plt.subplots(figsize = figsize)
  data.plot.bar(ax = ax)
  ax.set_title("Most frequent 50 words' distribution of " + str(name))
  ax.set_ylabel('Counts')
In [ ]:
# def a function to get the report
#------------------frequent_words_reports---------------------
def frequent_words_reports(data, feature = 'style'):
  max_print_out(True)
  frequent_words = data[feature].str.split(expand=True).stack().value_counts().head(50) # get value accounts for words
  frequent_words = pd.DataFrame(frequent_words)
  plot_frequenct_words_bar(frequent_words)
  return frequent_words

Load our prepared no stop words datasets.

In [ ]:
converted_data_no_stop_words = pd.read_csv('/content/drive/MyDrive/A3/no_stop_words_data.csv')
converted_data_no_stop_words = converted_data_no_stop_words.drop('Unnamed: 0', axis=1) # old index is useless now. drop it
converted_data_no_stop_words = converted_data_no_stop_words.convert_dtypes()
converted_data_no_stop_words = converted_data_no_stop_words.fillna('empty')
converted_data_no_stop_words.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 884604 entries, 0 to 884603
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   verified      884604 non-null  Int64 
 1   style         884604 non-null  string
 2   reviewerName  884604 non-null  string
 3   reviewText    884604 non-null  string
 4   summary       884604 non-null  string
 5   score         884604 non-null  Int64 
dtypes: Int64(2), string(4)
memory usage: 42.2 MB

1.4.0.2 Word Report Function

1 Million instances are just too big.

We need to write our own data structure to get the results.

By using dictionaries.

If exits then add 1, if not then initialize with 1. Simple.

In [ ]:
converted_data_no_stop_words.columns
Out[ ]:
Index(['verified', 'style', 'reviewerName', 'reviewText', 'summary', 'score'], dtype='object')
In [ ]:
#--------------- get_words_reports-------------
from tqdm.notebook import tqdm
def get_words_dictionary(data, column_number = 3):
  # initialize dictionary
  words_dictionary = {}
  # loop all instances 
  for i in tqdm(range(len(data))):
    # get text from each instances
    text_array  = data.iloc[i, column_number]
    # get words
    for text in text_array.split():
      # if the word doesn't exits in dictionary then set number to 1
      if words_dictionary.get(text) == None:
        words_dictionary[text] = 1
      # if the word alreadt exits, then add number 1
      else:
        words_dictionary[text] = words_dictionary.get(text) + 1 
  return words_dictionary
In [ ]:
#-------------words_frequency_report--------------
def words_frequency_report(data, feature = 'reviewText',show_all=False,fig_size = (15,10)):
  # get column number
  column_number = data.columns.get_loc(feature)
  print("Start getting word reports by ", feature)
  words_dictionary = get_words_dictionary(data, column_number)
  print("### Finish get the words report")
  # get report 
  report = pd.DataFrame.from_dict(words_dictionary, orient='index')
  report.columns = ['counts']
  print("Load into Pandas dataFrame")
  # sort report
  report = report.sort_values(by=['counts'],  ascending=False)
  print("Sorting DataFrame")
  # decide if print all
  if show_all:
    report_head = report
  else:    
    # get first 50 columns
    report_head = report.head(50)

  print("Get report's top 50 words\nStart plotting")
  # plot setting
  fig, ax = plt.subplots(figsize = fig_size)
  report_head.plot.barh(ax = ax)

  if show_all:
      ax.set_title("Distribution of " + str(feature))
  else:  
    ax.set_title("Most frequent 50 words' distribution of " + str(feature))
  ax.set_xlabel('Counts')
  ax.set_ylabel('Words')
  plt.gca().invert_yaxis()
  return report

1.4.1 Distribution of the top 50 most frequent words

1. style

We do something unsual in style exclusively. We know that our type of book is style. And no matter how many words they have in one instance, they actually mean one type of reading materials.

Hence, in case we get something like Kindle or Edition as two words, we delete 'space' in each instances' style feature and combine all words as one.

In [ ]:
copy = converted_data_no_stop_words.copy()
In [ ]:
converted_data_no_stop_words = copy.copy()
In [ ]:
# we recover the words with capitalized format
converted_data_no_stop_words['style'] = converted_data_no_stop_words['style'].str.upper()
In [ ]:
style_frequent_words = words_frequency_report(converted_data_no_stop_words.copy(), 'style')
Start getting word reports by  style
### Finish get the words report
Load into Pandas dataFrame
Sorting DataFrame
Get report's top 50 words
Start plotting

We can see we actually get the result of "Edition" as the second top frequent used word. But Edition actually is part of "Kindle Edition" by the most frequent type of Format.

We will handle this later in question ii by remove the space in every format instance to make those words as one word.

In [ ]:
style_frequent_words.head(10)
Out[ ]:
counts
kindle 462031
edition 462031
paperback 252899
hardcover 145751
mass 75837
market 75837
book 9326
board 9173
cd 5455
audio 5212

2. Review Text

In [ ]:
reviewText_frequent_words = words_frequency_report(converted_data_no_stop_words.copy(), 'reviewText')
Start getting word reports by  reviewText
### Finish get the words report
Load into Pandas dataFrame
Sorting DataFrame
Get report's top 50 words
Start plotting

We can see that we still can enlarge our stop_words array.

Since the top frequent word is book, which is not really useful.

And the top 5 are book, read, one, books and great. Actually great is a good word.

We will deal this later.

In [ ]:
reviewText_frequent_words.head(10)
Out[ ]:
counts
stars 137083
book 99613
five 97653
great 83670
read 81271
good 63601
love 32858
story 31925
four 25027
series 24838

3. summary

In [ ]:
reviewText_frequent_words = words_frequency_report(converted_data_no_stop_words.copy(), 'summary')
Start getting word reports by  summary
### Finish get the words report
Load into Pandas dataFrame
Sorting DataFrame
Get report's top 50 words
Start plotting

This seems like much useful. Five, stars, good, great, love, those are all good words for high score.

4. reviewerName

In [ ]:
reviewText_frequent_words = words_frequency_report(converted_data_no_stop_words.copy(), 'reviewerName')
Start getting word reports by  reviewerName
### Finish get the words report
Load into Pandas dataFrame
Sorting DataFrame
Get report's top 50 words
Start plotting

1.4.2 Proportion of each format

We reuse our words report function but this time we draw all the words we get.

But first we remove the space in every instance. To make every format as one word.

In [ ]:
converted_data_no_stop_words['style'] = converted_data_no_stop_words['style'].str.replace(' ', '')
In [ ]:
style_format_report = words_frequency_report(converted_data_no_stop_words, 'style', show_all=True, fig_size = (15,15))
Start getting word reports by  style
### Finish get the words report
Load into Pandas dataFrame
Sorting DataFrame
Get report's top 50 words
Start plotting

Now, we calculate its proportion.

In [ ]:
# First we print all formats' counts
style_format_report
Out[ ]:
counts
KINDLEEDITION 461558
PAPERBACK 176902
HARDCOVER 145751
MASSMARKETPAPERBACK 75837
BOARDBOOK 9173
AUDIOCD 4528
AUDIBLEAUDIOBOOK 1484
SPIRALBOUND 1234
MP3CD 922
AUDIOCASSETTE 669
MAP 642
STAPLEBOUND 576
LIBRARYBINDING 543
KINDLEEDITIONAUDIOVIDEO 473
DVD 431
RINGBOUND 417
IMITATIONLEATHER 340
LEATHERBOUND 265
HARDCOVERSPIRAL 183
SCHOOLLIBRARYBINDING 169
PAMPHLET 162
PERFECTPAPERBACK 159
FLEXIBOUND 153
AMAZONVIDEO 142
VINYLBOUND 134
UNKNOWNBINDING 133
LOOSELEAF 133
BONDEDLEATHER 128
MISCSUPPLIES 125
CARDS 124
COLORCOLORINGBOOK 111
MISC 107
JOURNAL 106
STATIONERY 103
PLASTICCOMB 99
DIARY 94
ACCESSORY 80
RAGBOOK 42
BLURAY 41
CDROM 39
VHSTAPE 37
CALENDAR 34
TOY 32
ROUGHCUT 31
HEALTHBEAUTY 20
PRINTEDACCESSCODE 17
TURTLEBACK 16
MP3MUSIC 12
PACKAGEQUANTITY1 10
PRELOADEDDIGITALAUDIOPLAYER 10
UNBOUND 9
KITCHEN 8
COMIC 7
VINYL 7
TEXTBOOKBINDING 6
DVDROM 6
AUDIOCDLIBRARYBINDING 5
BABYPRODUCT 5
OFFICEPRODUCT 4
SHEETMUSIC 3
HDDVD 2
DVDR 2
POSTER 2
DIGITAL 2
GAME 1
PRINTDEMANDPAPERBACK 1
VIDEOGAME 1
PRINTDEMAND 1
ELECTRONICS 1
In [ ]:
style_format_proportion = style_format_report.div(style_format_report.sum()) # divide by the summation of all format

Now, draw the proportion graph.

In [ ]:
fig,ax = plt.subplots(figsize=(10,15))
style_format_proportion.plot.barh(ax=ax)
ax.set_title("Proportion of each format", fontsize=20)
ax.set_ylabel('Proportion')
plt.xticks(fontsize=16);
plt.gca().invert_yaxis()

We can see there are almost 50% format is Kindle Edition.

Proportion data frame:

In [ ]:
max_print_out(True)
style_format_proportion
Out[ ]:
counts
KINDLEEDITION 0.52
PAPERBACK 0.20
HARDCOVER 0.16
MASSMARKETPAPERBACK 0.09
BOARDBOOK 0.01
AUDIOCD 0.01
AUDIBLEAUDIOBOOK 0.00
SPIRALBOUND 0.00
MP3CD 0.00
AUDIOCASSETTE 0.00
MAP 0.00
STAPLEBOUND 0.00
LIBRARYBINDING 0.00
KINDLEEDITIONAUDIOVIDEO 0.00
DVD 0.00
RINGBOUND 0.00
IMITATIONLEATHER 0.00
LEATHERBOUND 0.00
HARDCOVERSPIRAL 0.00
SCHOOLLIBRARYBINDING 0.00
PAMPHLET 0.00
PERFECTPAPERBACK 0.00
FLEXIBOUND 0.00
AMAZONVIDEO 0.00
VINYLBOUND 0.00
UNKNOWNBINDING 0.00
LOOSELEAF 0.00
BONDEDLEATHER 0.00
MISCSUPPLIES 0.00
CARDS 0.00
COLORCOLORINGBOOK 0.00
MISC 0.00
JOURNAL 0.00
STATIONERY 0.00
PLASTICCOMB 0.00
DIARY 0.00
ACCESSORY 0.00
RAGBOOK 0.00
BLURAY 0.00
CDROM 0.00
VHSTAPE 0.00
CALENDAR 0.00
TOY 0.00
ROUGHCUT 0.00
HEALTHBEAUTY 0.00
PRINTEDACCESSCODE 0.00
TURTLEBACK 0.00
MP3MUSIC 0.00
PACKAGEQUANTITY1 0.00
PRELOADEDDIGITALAUDIOPLAYER 0.00
UNBOUND 0.00
KITCHEN 0.00
COMIC 0.00
VINYL 0.00
TEXTBOOKBINDING 0.00
DVDROM 0.00
AUDIOCDLIBRARYBINDING 0.00
BABYPRODUCT 0.00
OFFICEPRODUCT 0.00
SHEETMUSIC 0.00
HDDVD 0.00
DVDR 0.00
POSTER 0.00
DIGITAL 0.00
GAME 0.00
PRINTDEMANDPAPERBACK 0.00
VIDEOGAME 0.00
PRINTDEMAND 0.00
ELECTRONICS 0.00

Except the top a few formats, the other format become 0% proportion.

By the counts table, we can see some of them are only 1 or 2 instances in the dataset, they are really rare.

We print the proportion of the not very frequent use format.

In [ ]:
fig,ax = plt.subplots(figsize=(10,15))
style_format_proportion.iloc[6:, :].plot.barh(ax=ax)
ax.set_title("Proportion of each format", fontsize=20)
ax.set_ylabel('Proportion')
plt.xticks(fontsize=16);
plt.gca().invert_yaxis()

1.4.3 Most/Least common format

First we replot the proportion graph.

In [ ]:
fig,ax = plt.subplots(figsize=(10,15))
style_format_proportion.plot.barh(ax=ax)
ax.set_title("Proportion of each format", fontsize=20)
ax.set_ylabel('Proportion')
plt.xticks(fontsize=16);
plt.gca().invert_yaxis()

We can see The Most common format of the books is Kindle Edition.

In [ ]:
style_format_proportion.head(1).index
Out[ ]:
array(['KINDLEEDITION'], dtype=object)

We can see The Least common format of the books is Electronics.

Acutaully, that is not really a book format, we can say print demand is the least common book format.

In [ ]:
style_format_proportion.tail(1).index
Out[ ]:
Index(['ELECTRONICS'], dtype='object')

1.4.4 Data Patterns

1. Most common 6 types of format has same distribution in scores.

In [ ]:
fig,ax = plt.subplots(figsize=(10,8))
style_format_proportion.iloc[:6,:].plot.bar(ax=ax)
ax.set_title("Proportion of each format", fontsize=20)
ax.set_ylabel('Proportion')
plt.xticks(fontsize=16);

We can see that Kindle Edition are the best seller.

Then following up with PaperBack and Hardcover. People buy paperback since paperback format is cheaper. And it's interesting to see that almost same proportion of hardcover book are sold as well.

And the next one is Mass Paperback book.

What Is a Mass Market Paperback? A mass market paperback book (MMPB), or simply mass paperback, is a mass-produced book that is typically small with thin paper covers and relatively low-quality pages to keep printing costs down. Bestsellers are often printed as mass market paperbook books for wide distribution Masterclass

Small books are also sold well.

Those four format of book are contributed almost all reviews.

We would like to see the overall score in each format book.

In [ ]:
style_format_proportion.head(6)
Out[ ]:
counts
KINDLEEDITION 0.52
PAPERBACK 0.20
HARDCOVER 0.16
MASSMARKETPAPERBACK 0.09
BOARDBOOK 0.01
AUDIOCD 0.01
In [ ]:
def draw_plot(results):
  fig, ax = plt.subplots(1,2,figsize=(15,8))
  sns.boxplot(data=results, ax = ax[0])
  ax[0].set_ylabel('Scores')
  ax[0].set_title('Boxplot of scores')
  sns.histplot(data=results, ax = ax[1])
  ax[1].set_title('Historgram of scores')
1. Kindle format over all score
In [ ]:
kindle_book_review = converted_data_no_stop_words[converted_data_no_stop_words['style'] == 'KINDLEEDITION']
In [ ]:
kindle_book_review.head()
Out[ ]:
verified style reviewerName reviewText summary score
0 0 KINDLEEDITION rub chicken book starts really interesting ideas gets cree... really interesting ideas gets creepy boring go 2
2 1 KINDLEEDITION reader pacific read number murakami novels seem follow hollow... hollow man 4
3 0 KINDLEEDITION j williams great story highly recommended five stars 5
5 1 KINDLEEDITION thomas elwood gave son lawyer worked washington dcwe loved i... must reading lawyers practicing big cities cli... 5
7 0 KINDLEEDITION patti great read unexpected turns 60 plus relate ins... unexpected 4
In [ ]:
print("People who read kindle give average", round(kindle_book_review['score'].mean(),2), " scores")
People who read kindle give average 4.35  scores
In [ ]:
kindle_score = pd.DataFrame(kindle_book_review['score'])
kindle_score.columns = ['Kindle']
draw_plot(kindle_score)
2. Paperback format over all score
In [ ]:
paperback_review = converted_data_no_stop_words[converted_data_no_stop_words['style'] == 'PAPERBACK']
In [ ]:
print("People who read paperback give average", round(paperback_review['score'].mean(),2), " scores")
People who read paperback give average 4.4  scores
In [ ]:
paperback_score = pd.DataFrame(paperback_review['score'])
paperback_score.columns = ['Paperback']
draw_plot(paperback_score)
3. Hardback format over all score
In [ ]:
HARDBACK_review = converted_data_no_stop_words[converted_data_no_stop_words['style'] == 'HARDCOVER']
print("People who read paperback give average", round(HARDBACK_review['score'].mean(),2), " scores")
People who read paperback give average 4.32  scores
In [ ]:
HARDBACK_score = pd.DataFrame(HARDBACK_review['score'])
HARDBACK_score.columns = ['HardCover']
draw_plot(HARDBACK_score)
In [ ]:
HARDBACK_score.mean()
Out[ ]:
HardCover   4.32
dtype: float64
4. Mass Market Paperback
In [ ]:
MASSMARKETPAPERBACK_review = converted_data_no_stop_words[converted_data_no_stop_words['style'] == 'MASSMARKETPAPERBACK']
print("People who read paperback give average", round(MASSMARKETPAPERBACK_review['score'].mean(),2), " scores")
People who read paperback give average 4.26  scores
In [ ]:
MASS_paperback_score = pd.DataFrame(MASSMARKETPAPERBACK_review['score'])
MASS_paperback_score.columns = ['MASS_paperback']
draw_plot(MASS_paperback_score)
In [ ]:
MASS_paperback_score.mean()
Out[ ]:
Paperback   4.26
dtype: float64
5. Board Book
In [ ]:
boardbook_review = converted_data_no_stop_words[converted_data_no_stop_words['style'] == 'BOARDBOOK']
print("People who read paperback give average", round(boardbook_review['score'].mean(),2), " scores")
boardbook_score = pd.DataFrame(boardbook_review['score'])
boardbook_score.columns = ['BOARDBOOK']
draw_plot(boardbook_score)
People who read paperback give average 4.64  scores
6.AudioCD
In [ ]:
AudioCD_review = converted_data_no_stop_words[converted_data_no_stop_words['style'] == 'AUDIOCD']
print("People who read paperback give average", round(AudioCD_review['score'].mean(),2), " scores")
AudioCD_score = pd.DataFrame(AudioCD_review['score'])
AudioCD_score.columns = ['BOARDBOOK']
draw_plot(AudioCD_score)
People who read paperback give average 4.39  scores

We can see that except the board Book, other format's reader give the same distribution of scores.

That means the book format probably not a very important feature.

Even Board Book has the same hitorgram look with others format.

2. Task 2: Text normalization and feature engineering (0.1)

  1. Create a new column merging review summary and text.
  2. Remove stop words.
  3. Remove numbers and other non-letter characters.
  4. Perform either lemmatization or stemming. Motivate your choice.
  5. Convert the corpus into a bag-of-words TF-IDF weighted vector representation.

2.1 Create a new column merging review summary and text.

First we merge two features' data.

In [ ]:
text_data = converted_data_no_stop_words.copy()
In [ ]:
text_data.head()
Out[ ]:
verified style reviewerName reviewText summary score
0 0 kindle edition rub chicken book starts really interesting ideas gets cree... really interesting ideas gets creepy boring go 2
1 1 hardcover bluegrassanne gift someone loved loved 5
2 1 kindle edition reader pacific read number murakami novels seem follow hollow... hollow man 4
3 0 kindle edition j williams great story highly recommended five stars 5
4 1 hardcover noraxpat passed book someone interested finance suze or... ok intro personal finance 3
In [ ]:
text_data['reviewText'].head(1)
Out[ ]:
0    book starts really interesting ideas gets cree...
Name: reviewText, dtype: string
In [ ]:
text_data['summary'].head(1)
Out[ ]:
0    really interesting ideas gets creepy boring go
Name: summary, dtype: string
In [ ]:
text_data['text'] = text_data['summary'] + " " + text_data['reviewText']
In [ ]:
text_data.head()
Out[ ]:
verified style reviewerName reviewText summary score text
0 0 kindle edition rub chicken book starts really interesting ideas gets cree... really interesting ideas gets creepy boring go 2 really interesting ideas gets creepy boring go...
1 1 hardcover bluegrassanne gift someone loved loved 5 loved gift someone loved
2 1 kindle edition reader pacific read number murakami novels seem follow hollow... hollow man 4 hollow man read number murakami novels seem fo...
3 0 kindle edition j williams great story highly recommended five stars 5 five stars great story highly recommended
4 1 hardcover noraxpat passed book someone interested finance suze or... ok intro personal finance 3 ok intro personal finance passed book someone ...
In [ ]:
max_print_out(True)
text_data.head(3).text.values
Out[ ]:
<StringArray>
[                           'really interesting ideas gets creepy boring go book starts really interesting ideas gets creepy boring go',
                                                                                                             'loved gift someone loved',
 'hollow man read number murakami novels seem follow hollow man idea chief characters struggles connect feel abandoned problems abound']
Length: 3, dtype: string

Done.

2.2. Remove stop words

We already removed stop words in previous section.

Hence, we copy the code to here.

In [ ]:
#--------------remove_stop_words-------------
def remove_stop_words(data, stop_words):
  feature = data.select_dtypes(exclude="number").columns
  for i in range(len(feature)):
      print("Now it's removing stop words from ", feature[i])
      # remove stop words
      # first change all character to lower case
      data[feature[i]] = data[feature[i]].str.lower()
      data[feature[i]] = data[feature[i]].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
  return data 

Our data set is already free of stop words.

In [ ]:
from nltk.corpus import stopwords  
stop_words = stopwords.words('english')
text_data = remove_stop_words(text_data, stop_words)
In [ ]:
text_data.head()
Out[ ]:
verified style reviewerName reviewText summary score text
0 0 KINDLEEDITION rub chicken book starts really interesting ideas gets cree... really interesting ideas gets creepy boring go 2 really interesting ideas gets creepy boring go...
1 1 HARDCOVER bluegrassanne gift someone loved loved 5 loved gift someone loved
2 1 KINDLEEDITION reader pacific read number murakami novels seem follow hollow... hollow man 4 hollow man read number murakami novels seem fo...
3 0 KINDLEEDITION j williams great story highly recommended five stars 5 five stars great story highly recommended
4 1 HARDCOVER noraxpat passed book someone interested finance suze or... ok intro personal finance 3 ok intro personal finance passed book someone ...

2.3. Remove numbers and other non-letter characters

The following function is for removing non letter characters

In [ ]:
#--------------remove_num_non_letters-------------
def remove_num_non_letters(data):
  feature = data.select_dtypes(exclude="number").columns
  for i in range(len(feature)):
      print("Now it's removing num_non_letters from ", feature[i])
      # remove stop words
      # remove num_non_letters
      data[feature[i]] = data[feature[i]].str.replace('[^\w\s]+', '')
      data[feature[i]] = data[feature[i]].str.replace('[0-9]+', '')
  return data     
In [ ]:
text_data = remove_num_non_letters(text_data)
Now it's removing num_non_letters from  style
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:8: FutureWarning: The default value of regex will change from True to False in a future version.
  
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:9: FutureWarning: The default value of regex will change from True to False in a future version.
  if __name__ == '__main__':
Now it's removing num_non_letters from  reviewerName
Now it's removing num_non_letters from  reviewText
Now it's removing num_non_letters from  summary
Now it's removing num_non_letters from  text
In [ ]:
text_data.head()
Out[ ]:
verified style reviewerName reviewText summary score text
0 0 KINDLEEDITION rub chicken book starts really interesting ideas gets cree... really interesting ideas gets creepy boring go 2 really interesting ideas gets creepy boring go...
1 1 HARDCOVER bluegrassanne gift someone loved loved 5 loved gift someone loved
2 1 KINDLEEDITION reader pacific read number murakami novels seem follow hollow... hollow man 4 hollow man read number murakami novels seem fo...
3 0 KINDLEEDITION j williams great story highly recommended five stars 5 five stars great story highly recommended
4 1 HARDCOVER noraxpat passed book someone interested finance suze or... ok intro personal finance 3 ok intro personal finance passed book someone ...

2.4 Perform either lemmatization or stemming. Motivate your choice.

Here we choose Lemmatization, since it will remain the words' meaning.

Instead of cut into some small pieces of word but make no sence.

In [ ]:
import nltk
from nltk.stem import PorterStemmer


w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

ps = PorterStemmer()

def lemmatize_text(text):
  return [lemmatizer.lemmatize(w,'v') for w in w_tokenizer.tokenize(text)]

def lemmatize_dataset(data):
  tt = pd.DataFrame(data['text'])
  data['token_text'] = tt.text.apply(lemmatize_text)
  
  return data
In [ ]:
lemmatized_text_data = lemmatize_dataset(text_data)
In [ ]:
lemmatized_text_data.to_csv('lemmatized_text_data.csv')
In [ ]:
# Here we delete useless features
lemmatized_text_data = lemmatized_text_data.drop(['reviewText','summary','text','reviewerName'],axis=1)
In [ ]:
lemmatized_text_data['score'] = lemmatized_text_data.pop('score')
In [ ]:
lemmatized_text_data.head()
Out[ ]:
verified style token_text score
0 0 kindle edition ['really', 'interest', 'ideas', 'get', 'creepy... 2
1 1 hardcover ['love', 'gift', 'someone', 'love'] 5
2 1 kindle edition ['hollow', 'man', 'read', 'number', 'murakami'... 4
3 0 kindle edition ['five', 'star', 'great', 'story', 'highly', '... 5
4 1 hardcover ['ok', 'intro', 'personal', 'finance', 'pass',... 3

2.5 Convert the corpus into a bag-of-words TF-IDF weighted vector representation

Before we do anything, we first load our converted data from the previously stored csv file.

In [ ]:
lemmatized_text_data = pd.read_csv('/content/drive/MyDrive/A3/lemmatized_text_data.csv')
lemmatized_text_data = lemmatized_text_data.drop('Unnamed: 0', axis=1) # old index is useless now. drop it
lemmatized_text_data = lemmatized_text_data.convert_dtypes()
lemmatized_text_data = lemmatized_text_data.fillna('empty')
lemmatized_text_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 884604 entries, 0 to 884603
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   verified      884604 non-null  Int64 
 1   style         884604 non-null  string
 2   reviewerName  884604 non-null  string
 3   reviewText    884604 non-null  string
 4   summary       884604 non-null  string
 5   score         884604 non-null  Int64 
 6   text          884604 non-null  string
 7   token_text    884604 non-null  string
dtypes: Int64(2), string(6)
memory usage: 55.7 MB
In [ ]:
lemmatized_text_data.head()
Out[ ]:
verified style reviewerName reviewText summary score text token_text
0 0 kindle edition rub chicken book starts really interesting ideas gets cree... really interesting ideas gets creepy boring go 2 really interesting ideas gets creepy boring go... ['really', 'interest', 'ideas', 'get', 'creepy...
1 1 hardcover bluegrassanne gift someone loved loved 5 loved gift someone loved ['love', 'gift', 'someone', 'love']
2 1 kindle edition reader pacific read number murakami novels seem follow hollow... hollow man 4 hollow man read number murakami novels seem fo... ['hollow', 'man', 'read', 'number', 'murakami'...
3 0 kindle edition j williams great story highly recommended five stars 5 five stars great story highly recommended ['five', 'star', 'great', 'story', 'highly', '...
4 1 hardcover noraxpat passed book someone interested finance suze or... ok intro personal finance 3 ok intro personal finance passed book someone ... ['ok', 'intro', 'personal', 'finance', 'pass',...

No we use TDIDF on text features one by one. Since, I didn't find a solution to do all of them at once.

First we do TDIDF on our token_text that are already lemmatized.

In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# we set for max_features: as most words we saved for 500 words
# in order to not exceed our memory
v = TfidfVectorizer(stop_words='english', max_features = 500)

# get TDIDF array from token_text
x_token_text = v.fit_transform(lemmatized_text_data['token_text'])

Now we have our token TDIDF set with 500 columns.

In [ ]:
x_token_text.shape
Out[ ]:
(884604, 500)
In [ ]:
# save it into a pandas dataframe
tdidf_data = pd.DataFrame(x_token_text.toarray())
In [ ]:
tdidf_data.head()
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 490 491 492 493 494 495 496 497 498 499
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows Ă— 500 columns

In [ ]:
tdidf_data.describe()
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 490 491 492 493 494 495 496 497 498 499
count 884604.000000 884604.000000 884604.000000 884604.000000 884604.000000 884604.000000 884604.000000 884604.000000 884604.000000 884604.000000 ... 884604.000000 884604.000000 884604.000000 884604.000000 884604.00000 884604.000000 884604.000000 884604.000000 884604.000000 884604.000000
mean 0.004150 0.004611 0.002784 0.007143 0.004623 0.004318 0.002663 0.002208 0.005381 0.004411 ... 0.030626 0.006202 0.002362 0.003373 0.00644 0.009179 0.002618 0.003307 0.005651 0.003538
std 0.034150 0.040591 0.032922 0.047640 0.034745 0.035308 0.031381 0.027655 0.044634 0.037704 ... 0.082974 0.046754 0.029465 0.033506 0.04557 0.047719 0.029923 0.032198 0.040424 0.032685
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000
max 0.943918 1.000000 1.000000 1.000000 0.876551 1.000000 1.000000 0.909867 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 0.963821 0.924894 1.000000

8 rows Ă— 500 columns

Now we get the TDIDF array from style.

And we know that from previous section, there is only 6 types of style are mainly distribution in our dataset. Hence, we declare a new vectorizer and set our max_features = 8

In [ ]:
# now we get TDIDF array from style

v = TfidfVectorizer(stop_words='english', max_features = 8)
# get TDIDF array from style

x_style = v.fit_transform(lemmatized_text_data['style'])
In [ ]:
x_style.shape
Out[ ]:
(884604, 8)
In [ ]:
x_style.toarray()
Out[ ]:
array([[0.        , 0.        , 0.70710678, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.70710678, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

Now we combine then together.

In [ ]:
tdidf_data = pd.concat([tdidf_data, pd.DataFrame(x_style.toarray())], axis = 1)
In [ ]:
tdidf_data.head()
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 498 499 0 1 2 3 4 5 6 7
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.707107 0.0 0.707107 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.000000 1.0 0.000000 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.707107 0.0 0.707107 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.707107 0.0 0.707107 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.000000 1.0 0.000000 0.0 0.0 0.0

5 rows Ă— 508 columns

Now we put the verified column and our target value back to this dataset.

In [ ]:
tdidf_data['verified'] = lemmatized_text_data['verified']
tdidf_data['score'] = lemmatized_text_data['score']
In [ ]:
tdidf_data.head()
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 0 1 2 3 4 5 6 7 verified score
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.707107 0.0 0.707107 0.0 0.0 0.0 0 2
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 1.0 0.000000 0.0 0.0 0.0 1 5
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.707107 0.0 0.707107 0.0 0.0 0.0 1 4
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.707107 0.0 0.707107 0.0 0.0 0.0 0 5
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 1.0 0.000000 0.0 0.0 0.0 1 3

5 rows Ă— 510 columns

Note, we detached the rest of features like reviewName, which isn't that important. And the review text and summary, since they both are already in the token_text. We don't need those anymore.

Since, this result is too big, we won't save it into file again. Not write a function to generate this result.

In [ ]:
#----------------TDIDF_Data_generator---------------
def TDIDF_Data_generator(data, max_features = 500):
  from sklearn.feature_extraction.text import TfidfVectorizer
  from sklearn.feature_extraction.text import TfidfTransformer
  # we set for max_features: as most words we saved for 500 words
  # in order to not exceed our memory
  v_test = TfidfVectorizer(stop_words='english', max_features = max_features)
  v_style = TfidfVectorizer(stop_words='english', max_features = 8)
  # get TDIDF array from token_text
  x_token_text = v_test.fit_transform(data['token_text'])
  # save it into a pandas dataframe
  tdidf_data = pd.DataFrame(x_token_text.toarray(), columns = v_test.get_feature_names() )
  # get TDIDF array from style
  x_style = v_style.fit_transform(data['style'])
  tdidf_data = pd.concat([tdidf_data, pd.DataFrame(x_style.toarray(), columns = v_style.get_feature_names() )], axis = 1)
  data_copy = data.copy()
  data_copy = data_copy.reset_index() # need to reset index for matching the results
  tdidf_data['verified'] = data_copy['verified']
  tdidf_data['score'] = data_copy['score']
  return tdidf_data

We will use this function in the next chapter. Since there is no need to show the results twice.

Task 3: Build a model to predict overall score (0.3)

3.1. Use score as the target variable. Explain what is the task you’re solving (e.g.,

supervised x unsupervised, classification x regression x clustering or similarity matching x etc).

We are solving a superviesed multi-labeled classification task.

  1. We have score as our target label
  2. Our traget label, score, only has 5 values. We need to classify all instances into those 5 values. Hence, it is a multi-labeld classification problem.

3.2. Use a feature selection method to select the features to build a model.

In [ ]:
lemmatized_text_data = pd.read_csv('/content/drive/MyDrive/A3/lemmatized_text_data.csv')
lemmatized_text_data = lemmatized_text_data.drop('Unnamed: 0', axis=1) # old index is useless now. drop it
lemmatized_text_data = lemmatized_text_data.convert_dtypes()
lemmatized_text_data = lemmatized_text_data.fillna('empty')
lemmatized_text_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 884604 entries, 0 to 884603
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   verified      884604 non-null  Int64 
 1   style         884604 non-null  string
 2   reviewerName  884604 non-null  string
 3   reviewText    884604 non-null  string
 4   summary       884604 non-null  string
 5   score         884604 non-null  Int64 
 6   text          884604 non-null  string
 7   token_text    884604 non-null  string
dtypes: Int64(2), string(6)
memory usage: 55.7 MB

Now, we split the dataset first for more ability to find good features.

we take only 10% of data to do the feature selection. As our 100K dataset.

In [ ]:
from sklearn.model_selection import train_test_split
model_data_raw, _ = train_test_split(lemmatized_text_data, test_size=0.88695, random_state=42)
# we drop reviewText and reivewName, summary and text columns, since we already have token_text and review Name is useless.
model_data_raw = model_data_raw.drop(['reviewerName', 'reviewText', 'summary', 'text'], axis = 1)
# rearange the target label feature score to the end
model_data_raw['score'] = model_data_raw.pop('score')
In [ ]:
model_data_raw.head()
Out[ ]:
verified style token_text score
744031 1 kindle edition ['sigh', 'miss', 'claire', 'jamie', 'vague', '... 1
801184 1 kindle edition ['remarkable', 'book', 'many', 'level', 'thoro... 5
341256 0 paperback ['buy', 'book', 'laurel', 'hardy', 'fan', 'lov... 5
734969 0 hardcover ['enthral', 'suspense', 'readers', 'familiar',... 4
750145 0 hardcover ['great', 'story', 'fantastic', 'illustrations... 5
In [ ]:
model_data_raw.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100004 entries, 744031 to 121958
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   verified    100004 non-null  Int64 
 1   style       100004 non-null  string
 2   token_text  100004 non-null  string
 3   score       100004 non-null  Int64 
dtypes: Int64(2), string(2)
memory usage: 4.0 MB

That's the final version we need.

In [ ]:
model_data = TDIDF_Data_generator(model_data_raw)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
In [ ]:
model_data = model_data.fillna(0) # in case of losing values during the period of generating TDIDF array
In [ ]:
model_data.head()
Out[ ]:
able absolutely account action actually add addition adult adventure age ... board book edition hardcover kindle market mass paperback verified score
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.707107 0.0 0.707107 0.0 0.0 0.0 1 1
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.707107 0.0 0.707107 0.0 0.0 0.0 1 5
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 1.0 0 5
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 1.0 0.000000 0.0 0.0 0.0 0 4
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 1.0 0.000000 0.0 0.0 0.0 0 5

5 rows Ă— 510 columns

Split our model_data into train and test set.

In [ ]:
from sklearn.model_selection import train_test_split
training_data, test_data = train_test_split(model_data, test_size=0.3, random_state=42)
In [ ]:
# get our data set into features and labels
X_train = training_data.iloc[:,:-1]
y_train = training_data.iloc[:,-1:].values.ravel()
y_train = y_train.astype(int)
X_test = test_data.iloc[:,:-1]

First we save the feature name.

In [ ]:
features_name = X_train.loc[:,:].columns.tolist();
len(features_name)
Out[ ]:
509

Reuse our feature selection function from Assignment 2.

In [ ]:
from sklearn.feature_selection import SelectKBest

# feature selection
def select_features_prompt(X_train, y_train, X_test,function):
    # configure to select all features
    fs = SelectKBest(score_func=function, k='all')
    # learn relationship from training data
    fs.fit(X_train, y_train)
    # transform train input data
    X_train_fs = fs.transform(X_train)
    # transform test input data
    X_test_fs = fs.transform(X_test)
    # what are scores for the features

    # print features' name and score
    for i in range(len(fs.scores_)):
        print(f'Feature {i}  {features_name[i]}: { fs.scores_[i]}' )
    return fs.scores_

Get all features importances as following:

In [ ]:
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

fscores = select_features_prompt(X_train, y_train,X_test, f_classif)
Feature 0  able: 1.021211817925015
Feature 1  absolutely: 40.960661262266825
Feature 2  account: 1.9686474229788244
Feature 3  action: 9.882491965111663
Feature 4  actually: 16.370282632085125
Feature 5  add: 0.3697776193695684
Feature 6  addition: 5.313261749502565
Feature 7  adult: 1.8548117384507112
Feature 8  adventure: 16.041688726739654
Feature 9  age: 3.51662690826065
Feature 10  ago: 2.9454857131359007
Feature 11  agree: 22.77748538386051
Feature 12  allow: 1.0113275724829718
Feature 13  amaze: 108.83079210017185
Feature 14  american: 1.7629237669136886
Feature 15  answer: 1.8660794847072606
Feature 16  appreciate: 4.3543828066308174
Feature 17  art: 1.005418037193231
Feature 18  ask: 1.759263052499672
Feature 19  attention: 2.5160887062643873
Feature 20  author: 33.47690088150282
Feature 21  away: 7.677713727912668
Feature 22  awesome: 89.80585345376421
Feature 23  baby: 0.9120079145840939
Feature 24  background: 6.306148194352899
Feature 25  bad: 217.40073448456047
Feature 26  base: 4.76837651372978
Feature 27  beautiful: 40.27367824252181
Feature 28  beautifully: 21.379716897175083
Feature 29  begin: 2.5158954888253384
Feature 30  believable: 8.08663276684383
Feature 31  believe: 24.48178143865838
Feature 32  best: 44.2534237206105
Feature 33  better: 75.59812655058425
Feature 34  big: 17.17251460313486
Feature 35  bite: 244.9390610126215
Feature 36  black: 0.7067708603564745
Feature 37  book: 83.5363780927885
Feature 38  bore: 879.5349065371071
Feature 39  boy: 1.9694911215742208
Feature 40  break: 4.041526640798681
Feature 41  brilliant: 12.037681551970325
Feature 42  bring: 8.663104255198016
Feature 43  brother: 2.781285393545565
Feature 44  brown: 3.56807728849369
Feature 45  build: 5.4465808603136905
Feature 46  business: 1.0244897707415062
Feature 47  buy: 48.47999280711704
Feature 48  care: 108.5408294880871
Feature 49  case: 17.29969155001157
Feature 50  cat: 2.8462686543987017
Feature 51  catch: 5.1827282336974525
Feature 52  certainly: 20.577699341823667
Feature 53  challenge: 2.2885689316798645
Feature 54  change: 2.2902446556318865
Feature 55  chapter: 21.337127941145443
Feature 56  chapters: 79.05378359813953
Feature 57  character: 96.76488567663492
Feature 58  charm: 1.8673659135612444
Feature 59  check: 7.246482266260677
Feature 60  child: 2.8728670262403955
Feature 61  children: 6.826310925871946
Feature 62  choose: 2.9481606022843567
Feature 63  christmas: 8.46748792989765
Feature 64  class: 0.6879160683819383
Feature 65  classic: 7.197529345373304
Feature 66  clear: 0.30510177189864834
Feature 67  close: 4.620234948040467
Feature 68  collection: 2.497995816379268
Feature 69  color: 8.018843578105193
Feature 70  come: 1.360440306710302
Feature 71  compel: 1.582488483664007
Feature 72  complete: 3.729685855811075
Feature 73  completely: 26.59049030371993
Feature 74  complex: 3.5819243641633656
Feature 75  condition: 10.383091539031419
Feature 76  confuse: 83.16762600494927
Feature 77  consider: 11.197753295820648
Feature 78  continue: 2.801674459632921
Feature 79  copy: 6.621508255383115
Feature 80  country: 0.5377051913789135
Feature 81  couple: 11.317047924995935
Feature 82  course: 2.8231867308355794
Feature 83  cover: 2.6913013156955636
Feature 84  create: 0.5447806779815108
Feature 85  culture: 3.8700798666749012
Feature 86  cute: 4.21590601827296
Feature 87  dark: 8.280765925283044
Feature 88  date: 9.094460223804699
Feature 89  daughter: 8.431900848325405
Feature 90  day: 9.341102826814945
Feature 91  days: 0.6845916834475211
Feature 92  dead: 4.435282533635656
Feature 93  deal: 1.2777886129413834
Feature 94  death: 0.7444245560263609
Feature 95  decide: 3.8416308864185194
Feature 96  deep: 0.2684572873483984
Feature 97  definitely: 11.631610606450943
Feature 98  deliver: 3.9975697924300286
Feature 99  depth: 21.61640247535476
Feature 100  descriptions: 13.055320789244288
Feature 101  develop: 17.209390688221212
Feature 102  development: 14.88787961262062
Feature 103  didnt: 381.50884373638604
Feature 104  die: 3.0044458655879165
Feature 105  different: 23.298939180744178
Feature 106  difficult: 25.71901376286913
Feature 107  disappoint: 269.33955414972957
Feature 108  discover: 5.879208122311144
Feature 109  doesnt: 52.97840724519207
Feature 110  dog: 0.3016873071691755
Feature 111  dont: 224.33378220668743
Feature 112  draw: 3.80369980667105
Feature 113  dream: 1.9167550835128595
Feature 114  earlier: 32.938965448425776
Feature 115  early: 10.765507658430487
Feature 116  easily: 1.1591455077574322
Feature 117  easy: 22.966922700818706
Feature 118  end: 83.43988849437211
Feature 119  engage: 7.577468257886735
Feature 120  enjoy: 115.75397201266655
Feature 121  enjoyable: 87.60409146519193
Feature 122  entertain: 42.349756720546765
Feature 123  entire: 2.9142211444134745
Feature 124  especially: 9.471450721959169
Feature 125  events: 7.410810270866533
Feature 126  exactly: 1.9647893299619354
Feature 127  excellent: 149.95972572168898
Feature 128  excite: 0.9054292025198968
Feature 129  expect: 85.14663025077458
Feature 130  experience: 2.948449578321241
Feature 131  explain: 2.7063034818501928
Feature 132  extremely: 9.702932704671339
Feature 133  eye: 1.691801071051601
Feature 134  face: 0.9471860417101372
Feature 135  fact: 16.18649309494391
Feature 136  fall: 6.562760713838693
Feature 137  family: 17.583694798854314
Feature 138  fan: 7.717437742690806
Feature 139  fantastic: 54.97384317232719
Feature 140  fantasy: 1.223346055640355
Feature 141  far: 8.287966680908088
Feature 142  fascinate: 16.566184031256732
Feature 143  fast: 14.73502136680402
Feature 144  father: 1.3947059737262757
Feature 145  favorite: 35.034290722302046
Feature 146  feel: 31.954986487478134
Feature 147  felt: 123.27837366011437
Feature 148  fiction: 2.206709796646705
Feature 149  fight: 2.1746825867357735
Feature 150  figure: 13.994875134633226
Feature 151  finally: 5.247259691428375
Feature 152  fine: 4.2633569114793355
Feature 153  finish: 154.8245969821002
Feature 154  focus: 30.096504891179716
Feature 155  follow: 13.707363501401192
Feature 156  food: 3.603898484819537
Feature 157  force: 19.788002158991784
Feature 158  forget: 5.232058167727248
Feature 159  form: 6.807185835962592
Feature 160  forward: 15.982841288371318
Feature 161  free: 10.659850683846404
Feature 162  friend: 1.482567862989908
Feature 163  friends: 12.670416919183694
Feature 164  fun: 51.65496259907225
Feature 165  funny: 1.1675276210649228
Feature 166  future: 2.1220250304231407
Feature 167  game: 1.3622335850622849
Feature 168  genre: 4.327477908479522
Feature 169  gift: 16.97220792875733
Feature 170  girl: 14.051566053865976
Feature 171  glad: 3.3842552677779416
Feature 172  god: 8.137776066778999
Feature 173  good: 360.7010738683538
Feature 174  great: 550.2035544259879
Feature 175  group: 2.406055515365452
Feature 176  grow: 6.563843716961437
Feature 177  guess: 12.057752888127789
Feature 178  guide: 0.91521260669867
Feature 179  guy: 20.560741793581112
Feature 180  half: 89.24060239960714
Feature 181  hand: 0.2601759187201671
Feature 182  happen: 17.69560608390351
Feature 183  happy: 9.421638509541502
Feature 184  hard: 29.739228943058002
Feature 185  havent: 0.49644925462028416
Feature 186  head: 6.1532052393419425
Feature 187  hear: 0.19331497596326036
Feature 188  heart: 15.798197508258395
Feature 189  help: 7.2530310424278275
Feature 190  helpful: 3.536396235968141
Feature 191  hero: 13.636572206215307
Feature 192  heroine: 40.24507444127414
Feature 193  hes: 6.719765811662437
Feature 194  high: 20.24098693594341
Feature 195  highly: 118.24213348374654
Feature 196  historical: 11.072696721513022
Feature 197  history: 6.711777940705834
Feature 198  hit: 4.847656941761069
Feature 199  hold: 6.81741298688267
Feature 200  home: 6.389599966757839
Feature 201  honest: 3.4496320899066664
Feature 202  hook: 14.344908125192934
Feature 203  hop: 133.9815227027199
Feature 204  hope: 8.08114921054935
Feature 205  hot: 6.455905594291686
Feature 206  house: 7.782484655357903
Feature 207  huge: 21.677039168844363
Feature 208  human: 6.683757073595557
Feature 209  humor: 1.8827552460368748
Feature 210  husband: 0.5786161430729829
Feature 211  id: 25.08875578757056
Feature 212  idea: 49.50184851667582
Feature 213  ideas: 4.778688016729829
Feature 214  ill: 19.993151508912888
Feature 215  illustrations: 7.83369423424127
Feature 216  im: 25.200292963205857
Feature 217  important: 2.0357053266507967
Feature 218  include: 3.194244862310044
Feature 219  information: 7.091349182252645
Feature 220  informative: 8.06009406412211
Feature 221  insight: 4.192969396106819
Feature 222  installment: 2.5659313922801843
Feature 223  instead: 126.82522457921483
Feature 224  intrigue: 7.537719524537181
Feature 225  introduce: 1.5033173054829314
Feature 226  involve: 5.670288705066902
Feature 227  isnt: 46.4601862322194
Feature 228  issue: 13.332155996533896
Feature 229  ive: 7.857117553029478
Feature 230  jack: 0.9066055683980557
Feature 231  job: 17.50618838975811
Feature 232  john: 0.3931862371754435
Feature 233  journey: 6.355230156744384
Feature 234  kid: 10.266186631277815
Feature 235  kill: 17.84881353310309
Feature 236  kind: 55.33127319877188
Feature 237  kindle: 15.847186459307935
Feature 238  know: 5.8761733344038465
Feature 239  lack: 202.51788579556668
Feature 240  language: 14.85871537108418
Feature 241  later: 9.508499011288354
Feature 242  laugh: 22.94335378449436
Feature 243  lead: 6.109237643459637
Feature 244  learn: 9.04479014794152
Feature 245  leave: 36.86955610558689
Feature 246  let: 11.470748957270839
Feature 247  level: 5.649413454175756
Feature 248  library: 17.825845212167454
Feature 249  life: 19.026379916512226
Feature 250  light: 37.54873454270926
Feature 251  like: 220.7866848533552
Feature 252  line: 23.777733455689884
Feature 253  list: 6.536082884067043
Feature 254  listen: 3.0562446831228858
Feature 255  little: 159.83974898353353
Feature 256  live: 12.06615416624612
Feature 257  long: 38.346086557106254
Feature 258  look: 7.825261305034279
Feature 259  lose: 22.042221944415086
Feature 260  lot: 32.640113280819996
Feature 261  love: 593.5232631018607
Feature 262  magic: 1.0028015100265435
Feature 263  main: 94.23939837508549
Feature 264  make: 20.843239880699997
Feature 265  man: 2.0754094580129814
Feature 266  matter: 4.749811698279922
Feature 267  maybe: 110.55105109934566
Feature 268  mean: 21.458561125665057
Feature 269  meet: 3.256706583578812
Feature 270  men: 1.6451787639014606
Feature 271  mention: 42.09574670160806
Feature 272  middle: 12.970291439089046
Feature 273  mind: 1.8720245174426242
Feature 274  miss: 6.814175245721961
Feature 275  mix: 1.9929920879148073
Feature 276  modern: 2.8721225676246336
Feature 277  money: 443.0838258709852
Feature 278  mother: 3.4760736226413527
Feature 279  movie: 0.5124530166830399
Feature 280  mr: 6.700101998201028
Feature 281  ms: 2.3783165745741934
Feature 282  murder: 12.770185464967339
Feature 283  mysteries: 5.0266170491047735
Feature 284  mystery: 10.548365822284602
Feature 285  need: 7.177054296974322
Feature 286  new: 0.5020211056311095
Feature 287  nice: 36.068669528055175
Feature 288  night: 5.220612667519281
Feature 289  nora: 4.834234784100506
Feature 290  note: 6.280051993583913
Feature 291  novel: 19.933500618107278
Feature 292  novels: 9.53536747972417
Feature 293  number: 2.739999210050063
Feature 294  offer: 3.4565173267873455
Feature 295  oh: 9.686438081021011
Feature 296  ok: 451.4919661385884
Feature 297  okay: 295.1471770039028
Feature 298  old: 5.837420294542503
Feature 299  ones: 1.0566075550855791
Feature 300  open: 1.2059029062783586
Feature 301  opinion: 34.00662828446518
Feature 302  order: 2.932917534828268
Feature 303  original: 7.33105691458293
Feature 304  overall: 130.70929360554035
Feature 305  pace: 20.093006912701895
Feature 306  page: 65.89579631825072
Feature 307  parent: 2.179225876611071
Feature 308  pass: 9.755016809552856
Feature 309  past: 9.840524746523911
Feature 310  pay: 53.13128549076389
Feature 311  people: 10.085791033394814
Feature 312  perfect: 41.47765826259362
Feature 313  person: 6.400384343576624
Feature 314  personal: 3.1466540161261634
Feature 315  perspective: 6.939832012927749
Feature 316  pick: 9.40956139649137
Feature 317  picture: 1.440827538359583
Feature 318  piece: 2.692844217879749
Feature 319  place: 14.739062024978688
Feature 320  plan: 1.659309193386911
Feature 321  play: 3.521618578448354
Feature 322  plot: 151.5642391057304
Feature 323  point: 67.46609470549589
Feature 324  political: 6.772084900124072
Feature 325  power: 0.9285154044795362
Feature 326  predictable: 208.96047736928867
Feature 327  present: 1.513378163758523
Feature 328  pretty: 134.6343932177447
Feature 329  previous: 48.566895323104156
Feature 330  price: 4.580722429408619
Feature 331  probably: 54.2777785793802
Feature 332  problem: 41.421654001612964
Feature 333  problems: 9.439357338027522
Feature 334  provide: 3.478308578663059
Feature 335  publish: 36.34457589628277
Feature 336  pull: 3.042883322689099
Feature 337  purchase: 12.568803889977596
Feature 338  quality: 10.49440421375283
Feature 339  question: 6.009928190990178
Feature 340  quick: 32.82035545898185
Feature 341  quickly: 2.355413036889932
Feature 342  quite: 69.71345934236605
Feature 343  rat: 25.856406391362196
Feature 344  reacher: 0.9980159500817469
Feature 345  read: 67.04635236542825
Feature 346  reader: 3.8847141421697264
Feature 347  readers: 7.509440700428964
Feature 348  real: 3.6459458057195144
Feature 349  realistic: 6.672731875586926
Feature 350  realize: 14.245295960959911
Feature 351  really: 22.585365146889057
Feature 352  reason: 48.15417526464343
Feature 353  receive: 1.996210427489986
Feature 354  recipes: 1.1574993060573722
Feature 355  recommend: 67.22660510746599
Feature 356  reference: 2.963495588842515
Feature 357  relate: 1.742983097059717
Feature 358  relationship: 8.828898457427103
Feature 359  relationships: 6.74624195821414
Feature 360  remember: 3.704935684124509
Feature 361  remind: 0.9715271521896603
Feature 362  require: 0.4240236676330219
Feature 363  reread: 13.367668038644378
Feature 364  research: 0.9974657645902327
Feature 365  rest: 8.361876644984301
Feature 366  return: 27.483225010395945
Feature 367  reveal: 4.574446253322842
Feature 368  review: 49.40911714193215
Feature 369  rich: 1.774304948063932
Feature 370  right: 3.0041563025476443
Feature 371  roberts: 2.5099127924107
Feature 372  romance: 6.3905747445388865
Feature 373  romantic: 3.6846914050436093
Feature 374  run: 7.967976033450664
Feature 375  sad: 2.336418598860409
Feature 376  satisfy: 3.54496233607807
Feature 377  save: 39.67766415051884
Feature 378  saw: 0.1300138528178174
Feature 379  say: 44.36389561478411
Feature 380  scenes: 24.287493973735327
Feature 381  school: 0.8297185242959743
Feature 382  science: 0.4894949761655719
Feature 383  second: 3.2142908153361684
Feature 384  sense: 19.39323892818728
Feature 385  series: 63.06996145150397
Feature 386  set: 10.348188844018837
Feature 387  sex: 93.76049114466163
Feature 388  share: 11.446693072668637
Feature 389  shes: 4.15895891094562
Feature 390  short: 40.74250634581348
Feature 391  simple: 1.8808154984739047
Feature 392  simply: 14.314910944347963
Feature 393  sister: 1.7892568108505984
Feature 394  sit: 1.172289103241123
Feature 395  slow: 147.13870031087555
Feature 396  small: 9.521126822839022
Feature 397  solve: 9.157845177160544
Feature 398  somewhat: 75.45127551821166
Feature 399  son: 10.621350900568622
Feature 400  soon: 2.5436091556850755
Feature 401  sort: 33.35025658032313
Feature 402  sound: 45.49462609591128
Feature 403  speak: 1.9139559795699737
Feature 404  spend: 51.82984716213818
Feature 405  stand: 2.7178118879752833
Feature 406  star: 84.7427073715615
Feature 407  start: 5.181520172678018
Feature 408  state: 2.8957725453480125
Feature 409  stay: 0.8549636819914425
Feature 410  stop: 16.583759845662332
Feature 411  stories: 3.1090205501716417
Feature 412  story: 42.06165476880934
Feature 413  storyline: 7.9084590552659
Feature 414  strong: 5.831795495528477
Feature 415  struggle: 6.767505515531607
Feature 416  study: 1.0231079419390925
Feature 417  stuff: 7.824445493920961
Feature 418  style: 22.35954918165135
Feature 419  subject: 3.373160332712664
Feature 420  summer: 6.48945890909576
Feature 421  sure: 34.046361813765465
Feature 422  surprise: 16.416317455526503
Feature 423  suspense: 8.66581801894244
Feature 424  sweet: 7.109459697008417
Feature 425  tale: 11.067104975838035
Feature 426  talk: 16.978438095633607
Feature 427  teach: 5.387067164654407
Feature 428  tell: 2.844186318310631
Feature 429  text: 3.687610485578472
Feature 430  th: 0.739498437560261
Feature 431  thank: 58.01822960864151
Feature 432  thats: 29.26008178231229
Feature 433  theme: 7.7265182043382366
Feature 434  theres: 9.75093922336736
Feature 435  thing: 39.229626212597466
Feature 436  things: 12.611506858605457
Feature 437  think: 97.54302180474065
Feature 438  thoroughly: 13.02460897804794
Feature 439  thrill: 4.401514308918216
Feature 440  thriller: 4.728934103948533
Feature 441  throw: 48.510084910212775
Feature 442  time: 32.496875584448695
Feature 443  title: 30.51094501908474
Feature 444  today: 5.897227879242434
Feature 445  totally: 13.488055281750023
Feature 446  touch: 11.554588265877193
Feature 447  town: 5.713339103229073
Feature 448  train: 5.679174289390794
Feature 449  travel: 3.4866641057052856
Feature 450  trilogy: 2.6888174749392766
Feature 451  trouble: 3.2299647377838676
Feature 452  true: 7.928984372944505
Feature 453  truly: 14.292483805878827
Feature 454  try: 106.09577828579195
Feature 455  turn: 6.923915020807808
Feature 456  turner: 9.950235039858251
Feature 457  twist: 23.43892245269411
Feature 458  type: 12.362327368771286
Feature 459  typical: 25.67255677092229
Feature 460  understand: 1.3311728335901925
Feature 461  unique: 7.777991966150199
Feature 462  use: 15.155076187859354
Feature 463  usual: 6.019483090459161
Feature 464  usually: 44.296866455154486
Feature 465  version: 13.936153420669447
Feature 466  view: 3.2786025566184107
Feature 467  visit: 2.1311193896532474
Feature 468  wait: 90.1229954412271
Feature 469  want: 3.6985573122856406
Feature 470  war: 4.109932647472349
Feature 471  wasnt: 177.57534055144825
Feature 472  waste: 1064.1764527885412
Feature 473  watch: 2.082506453866035
Feature 474  way: 19.030140242826615
Feature 475  ways: 2.278730123875187
Feature 476  weave: 10.211511775470962
Feature 477  white: 2.0851691941837553
Feature 478  wife: 2.8389601077631386
Feature 479  wish: 2.541512864282969
Feature 480  woman: 6.793596152370884
Feature 481  women: 3.0844224550759356
Feature 482  wonder: 6.272796413146895
Feature 483  wonderful: 154.89502083749963
Feature 484  wont: 10.290048910025304
Feature 485  word: 14.551764490557288
Feature 486  work: 6.728959657880534
Feature 487  world: 11.039125976693557
Feature 488  worth: 16.2242446376536
Feature 489  wow: 18.833289430002868
Feature 490  write: 29.711330364944214
Feature 491  writer: 2.730955782183786
Feature 492  writers: 4.984179415756183
Feature 493  wrong: 12.436032929002097
Feature 494  year: 11.197877518769074
Feature 495  years: 6.59474687612647
Feature 496  yes: 2.6071507283555553
Feature 497  youll: 1.6830519489548381
Feature 498  young: 1.442429996728022
Feature 499  youre: 2.5063124593887
Feature 500  board: 18.73894761361651
Feature 501  book: 18.652582444842214
Feature 502  edition: 89.50924210211903
Feature 503  hardcover: 29.951026169972266
Feature 504  kindle: 89.50924210211903
Feature 505  market: 12.978636287286268
Feature 506  mass: 12.978636287286268
Feature 507  paperback: 33.14730161702822
Feature 508  verified: 266.07145331429166

Save the results into a dataframe.

In [ ]:
results_df = pd.DataFrame(fscores, index=features_name , columns = ['importance'])
In [ ]:
results_df.head()
Out[ ]:
importance
able 1.021212
absolutely 40.960661
account 1.968647
action 9.882492
actually 16.370283

We can see the original results are not sorted.

Now we sort those values by descending.

In [ ]:
results_df = results_df.sort_values(by=['importance'],ascending=False)
results_df.head(10)
Out[ ]:
importance
waste 1064.176453
bore 879.534907
love 593.523263
great 550.203554
ok 451.491966
money 443.083826
didnt 381.508844
good 360.701074
okay 295.147177
disappoint 269.339554

It's clearly to see that waste, bore, love, great, ok, money and etc are most import features in our TFIDF vector space. Now, we plot those as bar plot.

In [ ]:
#---------words_importance_plot---------------------
def words_importance_plot(results, fig_size = (15,10)):
  fig, ax = plt.subplots(figsize = fig_size)
  results.plot.barh(ax=ax)
  plt.gca().invert_yaxis()
  ax.set_ylabel('Importance')
  ax.set_title("Barplot of words' importance")

First we take a look at the bar plot of first 50 importance words.

In [ ]:
words_importance_plot(results_df.head(50))

We can see that there could be more importance words.

Let's draw the bar plot of first 100 words.

In [ ]:
words_importance_plot(results_df.head(100),fig_size=(15,18))

250 words:

In [ ]:
words_importance_plot(results_df.head(250),  fig_size = (15,30))

Now, we can see that after a certain number of words, the importances of all words after that are not that important anymore.

Now we plot the boxplot of the result dataframe.

In [ ]:
#---------words_importance_plot---------------------
def words_importance_barplot(results, fig_size = (8,8)):
  fig, ax = plt.subplots(figsize = fig_size)
  results.boxplot(ax=ax)
  ax.set_ylabel('Importance')
  ax.set_title("Barplot of words' importance")
In [ ]:
words_importance_barplot(results_df)
In [ ]:
build_continuous_features_report(results_df)
Out[ ]:
Count Miss % Card. Min 1st Qrt. Mean Median 3rd Qrt Max Std. Dev.
importance 509 0.0 507 0.130014 3.042883 32.234327 7.908459 22.359549 1064.176453 87.110186

We can see that the mean importance is only 32.23, and max is 1064. We don't want those features with low importance.

In [ ]:
find_outliers(results_df,'importance')
IQR = 22.35954918165135 - 3.042883322689099 = 19.31666585896225
MAX = 51.33454797009472
Min is 0
Num of min outliers:  0
Num of max outliers:  67
Num of negative outliers:  0
Num of the original data set's whole instance 509
Rate of purged data/total data 0.13163064833005894
Out[ ]:
Index(['waste', 'bore', 'love', 'great', 'ok', 'money', 'didnt', 'good',
       'okay', 'disappoint', 'verified', 'bite', 'dont', 'like', 'bad',
       'predictable', 'lack', 'wasnt', 'little', 'wonderful', 'finish', 'plot',
       'excellent', 'slow', 'pretty', 'hop', 'overall', 'instead', 'felt',
       'highly', 'enjoy', 'maybe', 'amaze', 'care', 'try', 'think',
       'character', 'main', 'sex', 'wait', 'awesome', 'kindle', 'edition',
       'half', 'enjoyable', 'expect', 'star', 'book', 'end', 'confuse',
       'chapters', 'better', 'somewhat', 'quite', 'point', 'recommend', 'read',
       'page', 'series', 'thank', 'kind', 'fantastic', 'probably', 'pay',
       'doesnt', 'spend', 'fun'],
      dtype='object')

Let's redraw the plot with any importance value larger than 51. By our find outliers function we defined before.

In [ ]:
words_importance_barplot(results_df[results_df['importance'] > 51])

Let's see how many of those words that importance > 51.

In [ ]:
len(results_df[results_df['importance'] > 51])
Out[ ]:
67

Only 67 features. Let's redraw the bar plot.

In [ ]:
words_importance_plot((results_df[results_df['importance'] > 51]),fig_size=(15,15))

Intuitively, we can see that only a few words at front is important.

However, it doesn't mean everyone would use these words in their reviewText. If they didn't use any of words at top 10 or top 15, we have no clue what their attitude of the product is.

Hence, we must leave the most important features with some redundancy。

In the end, we will take those 67 features all, then create a new dataset.

In [ ]:
results_df[results_df['importance'] > 51].index
Out[ ]:
Index(['waste', 'bore', 'love', 'great', 'ok', 'money', 'didnt', 'good',
       'okay', 'disappoint', 'verified', 'bite', 'dont', 'like', 'bad',
       'predictable', 'lack', 'wasnt', 'little', 'wonderful', 'finish', 'plot',
       'excellent', 'slow', 'pretty', 'hop', 'overall', 'instead', 'felt',
       'highly', 'enjoy', 'maybe', 'amaze', 'care', 'try', 'think',
       'character', 'main', 'sex', 'wait', 'awesome', 'kindle', 'edition',
       'half', 'enjoyable', 'expect', 'star', 'book', 'end', 'confuse',
       'chapters', 'better', 'somewhat', 'quite', 'point', 'recommend', 'read',
       'page', 'series', 'thank', 'kind', 'fantastic', 'probably', 'pay',
       'doesnt', 'spend', 'fun'],
      dtype='object')
In [ ]:
model_data.head()
Out[ ]:
able absolutely account action actually add addition adult adventure age ... board book edition hardcover kindle market mass paperback verified score
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.707107 0.0 0.707107 0.0 0.0 0.0 1 1
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.707107 0.0 0.707107 0.0 0.0 0.0 1 5
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 1.0 0 5
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 1.0 0.000000 0.0 0.0 0.0 0 4
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 1.0 0.000000 0.0 0.0 0.0 0 5

5 rows Ă— 510 columns

Now, we extract all those 67 features with our target label : score.

In [ ]:
reduced_model_data = model_data.loc[:,  results_df[results_df['importance'] > 51].index]
reduced_model_data['score'] = model_data['score']
reduced_model_data.head(5)
Out[ ]:
waste bore love great ok money didnt good okay disappoint ... series thank kind fantastic probably pay doesnt spend fun score
0 0.348458 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.252675 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 1
1 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 5
2 0.000000 0.0 0.117233 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 5
3 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.159826 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 4
4 0.000000 0.0 0.270999 0.139326 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.000000 0.0 0.0 0.295925 0.0 0.0 0.0 0.0 0.0 5

5 rows Ă— 70 columns

That's what we need for our training process.

And let's save it to a new csv file. We will use this file since now.

In [ ]:
reduced_model_data.to_csv('reduced_model_data.csv')
In [ ]:
reduced_model_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 70 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   waste        100004 non-null  float64
 1   bore         100004 non-null  float64
 2   love         100004 non-null  float64
 3   great        100004 non-null  float64
 4   ok           100004 non-null  float64
 5   money        100004 non-null  float64
 6   didnt        100004 non-null  float64
 7   good         100004 non-null  float64
 8   okay         100004 non-null  float64
 9   disappoint   100004 non-null  float64
 10  verified     100004 non-null  Int64  
 11  bite         100004 non-null  float64
 12  dont         100004 non-null  float64
 13  like         100004 non-null  float64
 14  bad          100004 non-null  float64
 15  predictable  100004 non-null  float64
 16  lack         100004 non-null  float64
 17  wasnt        100004 non-null  float64
 18  little       100004 non-null  float64
 19  wonderful    100004 non-null  float64
 20  finish       100004 non-null  float64
 21  plot         100004 non-null  float64
 22  excellent    100004 non-null  float64
 23  slow         100004 non-null  float64
 24  pretty       100004 non-null  float64
 25  hop          100004 non-null  float64
 26  overall      100004 non-null  float64
 27  instead      100004 non-null  float64
 28  felt         100004 non-null  float64
 29  highly       100004 non-null  float64
 30  enjoy        100004 non-null  float64
 31  maybe        100004 non-null  float64
 32  amaze        100004 non-null  float64
 33  care         100004 non-null  float64
 34  try          100004 non-null  float64
 35  think        100004 non-null  float64
 36  character    100004 non-null  float64
 37  main         100004 non-null  float64
 38  sex          100004 non-null  float64
 39  wait         100004 non-null  float64
 40  awesome      100004 non-null  float64
 41  kindle       100004 non-null  float64
 42  kindle       100004 non-null  float64
 43  edition      100004 non-null  float64
 44  half         100004 non-null  float64
 45  enjoyable    100004 non-null  float64
 46  expect       100004 non-null  float64
 47  star         100004 non-null  float64
 48  book         100004 non-null  float64
 49  book         100004 non-null  float64
 50  end          100004 non-null  float64
 51  confuse      100004 non-null  float64
 52  chapters     100004 non-null  float64
 53  better       100004 non-null  float64
 54  somewhat     100004 non-null  float64
 55  quite        100004 non-null  float64
 56  point        100004 non-null  float64
 57  recommend    100004 non-null  float64
 58  read         100004 non-null  float64
 59  page         100004 non-null  float64
 60  series       100004 non-null  float64
 61  thank        100004 non-null  float64
 62  kind         100004 non-null  float64
 63  fantastic    100004 non-null  float64
 64  probably     100004 non-null  float64
 65  pay          100004 non-null  float64
 66  doesnt       100004 non-null  float64
 67  spend        100004 non-null  float64
 68  fun          100004 non-null  float64
 69  score        100004 non-null  Int64  
dtypes: Int64(2), float64(68)
memory usage: 53.6 MB
In [ ]:
reduced_model_data = pd.read_csv('/content/drive/MyDrive/A3/reduced_model_data.csv')
reduced_model_data = reduced_model_data.drop('Unnamed: 0', axis=1) # old index is useless now. drop it
reduced_model_data = reduced_model_data.convert_dtypes()
reduced_model_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 70 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   waste        100004 non-null  Float64
 1   bore         100004 non-null  Float64
 2   love         100004 non-null  Float64
 3   great        100004 non-null  Float64
 4   ok           100004 non-null  Float64
 5   money        100004 non-null  Float64
 6   didnt        100004 non-null  Float64
 7   good         100004 non-null  Float64
 8   okay         100004 non-null  Float64
 9   disappoint   100004 non-null  Float64
 10  verified     100004 non-null  Int64  
 11  bite         100004 non-null  Float64
 12  dont         100004 non-null  Float64
 13  like         100004 non-null  Float64
 14  bad          100004 non-null  Float64
 15  predictable  100004 non-null  Float64
 16  lack         100004 non-null  Float64
 17  wasnt        100004 non-null  Float64
 18  little       100004 non-null  Float64
 19  wonderful    100004 non-null  Float64
 20  finish       100004 non-null  Float64
 21  plot         100004 non-null  Float64
 22  excellent    100004 non-null  Float64
 23  slow         100004 non-null  Float64
 24  pretty       100004 non-null  Float64
 25  hop          100004 non-null  Float64
 26  overall      100004 non-null  Float64
 27  instead      100004 non-null  Float64
 28  felt         100004 non-null  Float64
 29  highly       100004 non-null  Float64
 30  enjoy        100004 non-null  Float64
 31  maybe        100004 non-null  Float64
 32  amaze        100004 non-null  Float64
 33  care         100004 non-null  Float64
 34  try          100004 non-null  Float64
 35  think        100004 non-null  Float64
 36  character    100004 non-null  Float64
 37  main         100004 non-null  Float64
 38  sex          100004 non-null  Float64
 39  wait         100004 non-null  Float64
 40  awesome      100004 non-null  Float64
 41  kindle       100004 non-null  Float64
 42  kindle.1     100004 non-null  Float64
 43  edition      100004 non-null  Float64
 44  half         100004 non-null  Float64
 45  enjoyable    100004 non-null  Float64
 46  expect       100004 non-null  Float64
 47  star         100004 non-null  Float64
 48  book         100004 non-null  Float64
 49  book.1       100004 non-null  Float64
 50  end          100004 non-null  Float64
 51  confuse      100004 non-null  Float64
 52  chapters     100004 non-null  Float64
 53  better       100004 non-null  Float64
 54  somewhat     100004 non-null  Float64
 55  quite        100004 non-null  Float64
 56  point        100004 non-null  Float64
 57  recommend    100004 non-null  Float64
 58  read         100004 non-null  Float64
 59  page         100004 non-null  Float64
 60  series       100004 non-null  Float64
 61  thank        100004 non-null  Float64
 62  kind         100004 non-null  Float64
 63  fantastic    100004 non-null  Float64
 64  probably     100004 non-null  Float64
 65  pay          100004 non-null  Float64
 66  doesnt       100004 non-null  Float64
 67  spend        100004 non-null  Float64
 68  fun          100004 non-null  Float64
 69  score        100004 non-null  Int64  
dtypes: Float64(68), Int64(2)
memory usage: 60.1 MB

We can see that our csv can remaint the same type of our data.

Then that's the end of our feature selection.

3.3. Select the evaluation metric/metrics. Justify your choice.

In this assignment, we will still mainly use accuracy as our main metric. Since we have a classification task to solve.

And instead of just using accuracy, we will still save the prediction and use sklearn's classification_report to show all the details about each feature's percision score, recall score and f1 score at the end.

But we will not plot those scores in section 3.

3.4. Perform hyperparameter tuning if applicable.

Now we start to split the train and test set from our new feature selected dataset.

In [ ]:
reduced_model_data = pd.read_csv('/content/drive/MyDrive/A3/reduced_model_data.csv')
reduced_model_data = reduced_model_data.drop('Unnamed: 0', axis=1) # old index is useless now. drop it
reduced_model_data = reduced_model_data.convert_dtypes()
In [ ]:
reduced_model_data.head()
Out[ ]:
waste bore love great ok money didnt good okay disappoint ... series thank kind fantastic probably pay doesnt spend fun score
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.18136 0.0 0.0 ... 0.0 0.0 0.35844 0.0 0.0 0.0 0.0 0.0 0.0 3
1 0.0 0.0 0.0 0.0 0.157539 0.0 0.0 0.0 0.0 0.0 ... 0.088223 0.0 0.0 0.0 0.0 0.0 0.139076 0.0 0.0 3
2 0.0 0.0 0.217624 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5
3 0.0 0.0 0.174643 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.218499 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5
4 0.0 0.0 0.181436 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5

5 rows Ă— 70 columns

Now we separate our dataset as training set, validation set and test set.

In [ ]:
#-----------get_model_set-------------
def get_model_set(data):
  # get our data set into features and labels
  X = data.iloc[:,:-1]
  y = data.iloc[:,-1:].values.ravel()
  y = y.astype(int)
  return X, y

We get our data as X and y as label.

We will use stratified sampling later.

In [ ]:
X_raw, y_raw = get_model_set(reduced_model_data)
In [ ]:
y_raw
Out[ ]:
array([3, 3, 5, ..., 3, 5, 5])

Split our data by stratified sampling by using stratify = y

And we split our set into training, validation and test set.

In [ ]:
X_train,X_test, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.2, random_state=42, stratify=y_raw)
X_train,X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

Test train for once

In [ ]:
from sklearn.ensemble import RandomForestClassifier
forest_cls = RandomForestClassifier(n_estimators = 50, random_state=42,verbose=1)
forest_cls.fit(X_train, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:   11.0s finished
Out[ ]:
RandomForestClassifier(n_estimators=50, random_state=42, verbose=1)

Now we predict our values on validation set.

In [ ]:
from sklearn.metrics import f1_score
churn_prediction = forest_cls.predict(X_valid)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.5s finished
In [ ]:
from sklearn.metrics import classification_report

target_names = ['1', '2','3','4','5']

print("Classification report of the first classifier:\n\n",
      classification_report(y_valid, churn_prediction, target_names=target_names))
Classification report of the first classifier:

               precision    recall  f1-score   support

           1       0.34      0.15      0.21       467
           2       0.25      0.08      0.12       579
           3       0.34      0.18      0.24      1304
           4       0.35      0.25      0.29      3024
           5       0.70      0.88      0.78      8780

    accuracy                           0.62     14154
   macro avg       0.40      0.31      0.33     14154
weighted avg       0.57      0.62      0.58     14154

Results are not quite good.

Let's use random Search directly this time.

  1. we know that the default max_features of random foreset is $\sqrt{q}$ and q as the number of total features. (Dalhousie STAT 3450)

So we make our max_feature in random search = $\sqrt{67} = 8.18 $. But we will truncate 8.18 to 8, since this is what R will do by default.

In [ ]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high = 100),
        'max_features': randint(low=1, high = 8),
    }

forest_cls_rs = RandomForestClassifier(random_state=42, verbose = 1)
rnd_search = RandomizedSearchCV(forest_cls_rs, param_distributions=param_distribs,
                                n_iter=10, cv=5, random_state=42)
rnd_search.fit(X_train, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:   10.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    1.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    9.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    6.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    6.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    6.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    1.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    1.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    1.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    1.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    1.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    6.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    6.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    8.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    8.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    5.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    2.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    2.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    2.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    2.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    2.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:   10.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:   10.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:   10.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:   10.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:   10.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    9.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    9.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:   10.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    9.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:   10.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    7.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    7.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    7.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   11.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    1.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   13.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    2.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    2.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    2.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    2.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    2.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    1.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    1.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    1.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    1.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    1.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    8.1s finished
Out[ ]:
RandomizedSearchCV(cv=5,
                   estimator=RandomForestClassifier(random_state=42, verbose=1),
                   param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fa9bb936090>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fa9bb965fd0>},
                   random_state=42)
In [ ]:
rnd_search.best_params_
Out[ ]:
{'max_features': 3, 'n_estimators': 72}
In [ ]:
joblib.dump(rnd_search, 'rnd_search.pkl')
Out[ ]:
['rnd_search.pkl']

Best estimator of our random search is:

In [ ]:
rnd_search.best_estimator_
Out[ ]:
RandomForestClassifier(max_features=3, n_estimators=72, random_state=42,
                       verbose=1)

Print the list of we tried and sorted the order

In [ ]:
cvres_rnd = rnd_search.cv_results_
for mean_score, params in sorted(zip(cvres_rnd["mean_test_score"], cvres_rnd["params"]), reverse=True):
    print(mean_score, params)
0.6268590725266622 {'max_features': 3, 'n_estimators': 72}
0.6262231533994889 {'max_features': 7, 'n_estimators': 83}
0.6259759065526608 {'max_features': 3, 'n_estimators': 88}
0.6256932428945665 {'max_features': 7, 'n_estimators': 75}
0.6241565149108879 {'max_features': 7, 'n_estimators': 52}
0.618928121798386 {'max_features': 5, 'n_estimators': 24}
0.6181156193987963 {'max_features': 3, 'n_estimators': 22}
0.6172677579106971 {'max_features': 5, 'n_estimators': 21}
0.613063870401616 {'max_features': 5, 'n_estimators': 15}
0.46968946545498175 {'max_features': 5, 'n_estimators': 2}

We can see that our best estimator is using RSV: bootstrap with 3 random features among all trees and use totally 72 trees.

3.5. Train and evaluate your model.

The work is not done by one day, so we reload our dataset and divide into train,val and test set.

In [ ]:
reduced_model_data = pd.read_csv('/content/drive/MyDrive/A3/reduced_model_data.csv')
reduced_model_data = reduced_model_data.drop('Unnamed: 0', axis=1) # old index is useless now. drop it
reduced_model_data = reduced_model_data.convert_dtypes()
In [ ]:
X_raw, y_raw = get_model_set(reduced_model_data)
X_train,X_test, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.2, random_state=42, stratify=y_raw)
X_train,X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

Now we can train our model with above hyperparamters and test it on validation set.

In [ ]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Train our model
forest_cls_final = RandomForestClassifier(max_features=3, n_estimators=72, random_state=42)
# train our model
forest_cls_final.fit(X_train, y_train)
pred_final_valid = forest_cls_final.predict(X_valid)

target_names = ['1', '2','3','4','5']

print("Classification report of the final classifier on validation set:\n\n",
      classification_report(y_valid, pred_final_valid, target_names=target_names))
Classification report of the final classifier on validation set:

               precision    recall  f1-score   support

           1       0.39      0.15      0.22       526
           2       0.23      0.06      0.09       656
           3       0.33      0.17      0.22      1467
           4       0.37      0.22      0.28      3414
           5       0.69      0.90      0.78      9938

    accuracy                           0.63     16001
   macro avg       0.40      0.30      0.32     16001
weighted avg       0.56      0.63      0.58     16001

Cross validation on the training set.

In [ ]:
from sklearn.model_selection import cross_val_score
original_result = cross_val_score(forest_cls_final, X_train, y_train, cv=10)
original_result
Out[ ]:
array([0.62990158, 0.63162006, 0.623125  , 0.62140625, 0.62265625,
       0.61984375, 0.6246875 , 0.61671875, 0.62640625, 0.62375   ])

We will use this result in Q4 for comparing with Q4's result.

Test it on test set.

In [ ]:
# test it on test set
prediction_final_test = forest_cls_final.predict(X_test)
In [ ]:
print("Classification report of the final classifier on validation set:\n\n",
      classification_report(y_test, prediction_final_test, target_names=target_names))
Classification report of the final classifier on validation set:

               precision    recall  f1-score   support

           1       0.37      0.16      0.22       658
           2       0.24      0.07      0.11       819
           3       0.36      0.16      0.22      1834
           4       0.37      0.23      0.28      4267
           5       0.69      0.89      0.78     12423

    accuracy                           0.62     20001
   macro avg       0.41      0.30      0.32     20001
weighted avg       0.56      0.62      0.58     20001

The best accuracy that I can get with this dataset is 0.62.

3.6. How do you make sure not to overfit?

We can keep increasing the number of features we have. But we will not do that yet. Since we will do part of speech tagging in section 4.

We will keep going from there.

And We can draw the learning curve and validation curve of our model to see whether the training score and validation score has departed.

For learning curve:

Since we are using the accuracy evaluation metrics, hence, if the valiation score become very larger and training score still small, then we know our model is overfitted.

We can based on our results to choose our train set size or number of trees in our model.

For validation curve: Although randomforest algroithm is not very easy to overfitted. We still can tune the hyperparameters to make sure it will not overfit. By looking into the validation curve, if along with the grow of number of trees, the train/loss curve has departed from each other. We know that our model become overfitted at that number of trees.

Use these two tools, we can make sure our model will not overfit our data.

3.7. Plot a visualization of the learning process or the learned information of the model.

3.7.1 Learning curve

A learning curve is a graphical representation of the relationship between how proficient people are at a task and the amount of experience they have. Proficiency (measured on the vertical axis) usually increases with increased experience (the horizontal axis), that is to say, the more someone, groups, companies or industries perform a task, the better their performance at the task.Wikipedia

We can find out from the LR curve that whether our model is overfitting/underfitting or whether our model could still imporve by more training epoches. It will help identify our model is good or bad.

set the model as our best estimator and finding the learning curve with scoring = accuracy. Sklearn document

In [ ]:
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import validation_curve

# set the model as our best estimator and finding the learning curve with scoring = accuracy
N, train_lc, val_lc = learning_curve(RandomForestClassifier(max_features=3, n_estimators=72, random_state=42),
                                     X_train, y_train, cv=5,scoring='accuracy',
                                      train_sizes=np.linspace(0.05, 1, 20)) # separate our training size by np.linspace, to 20 pieces
In [ ]:
joblib.dump(N,'N.pkl')
joblib.dump(train_lc,'train_lc.pkl')
joblib.dump(val_lc,'val_lc.pkl')
Out[ ]:
['val_lc.pkl']
In [ ]:
N = joblib.load('/content/drive/MyDrive/A3/pkls/N.pkl')
train_lc = joblib.load('/content/drive/MyDrive/A3/pkls/train_lc.pkl')
val_lc = joblib.load('/content/drive/MyDrive/A3/pkls/val_lc.pkl')

Now we define a function to plot the learning curve.

In [ ]:
#------------------------learning_curve------------------
def learning_curve(N, train_lc, val_lc):
  # set the figure size
  fig, ax = plt.subplots(figsize=(16, 6))
  # get the training score
  ax.plot(N, np.mean(train_lc, 1), color='blue', label='training score')
  # get the validation score
  ax.plot(N, np.mean(val_lc, 1), color='red', label='validation score')
  # draw the grid line
  ax.hlines(np.mean([train_lc[-1], val_lc[-1]]), N[0], N[-1],
                color='gray', linestyle='dashed')
  # graph setting up
  ax.set_ylim(0.5, 1.2)
  ax.set_xlim(N[0], N[-1])
  ax.set_xlabel('training size')
  ax.set_ylabel('Accuracy')
  ax.set_title("Random forest Accuracy Train/Valid of our final model")
  ax.legend(loc='best')
  fig.show()
In [ ]:
learning_curve(N, train_lc, val_lc)

3.7.2 Validation curve on different trees:

Although we used random search and grid search for finding the most proper hyperparameters, we still want to know that how many trees is good enough and how our model is influnenced by the different number of trees. Like what we did in polynomial regression with different degrees.

In this kind of visualization, we can give a intuitive way of showing the chaning of train/valid score by trees number.

This can help up to use a relatively good enough number of trees when we test our model on different dataset. Since if we best estimator only has 22 trees, we set our limit to 50 trees in this plot.

In [ ]:
from sklearn.model_selection import validation_curve
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
import numpy as np

    
n_estimators = np.arange(1, 50) # limit of number of estimators
train_score, val_score = validation_curve(forest_cls_final, X_train, y_train,
                                          param_name='n_estimators', param_range=n_estimators, cv=2
                                          , scoring = 'accuracy')
In [ ]:
joblib.dump(train_score,'est_train_score.pkl')
joblib.dump(val_score,'est_val_score.pkl')
Out[ ]:
['est_val_score.pkl']
In [ ]:
train_score = joblib.load('/content/drive/MyDrive/A3/pkls/est_train_score.pkl')
val_score = joblib.load('/content/drive/MyDrive/A3/pkls/est_val_score.pkl')
In [ ]:
#------------------------valid_score_curve------------------
def valid_score_curve(train_score, val_score, n_estimators = np.arange(1, 50)):
  fig, ax = plt.subplots(figsize=(16, 6))
  # get mean of 5 cv of values
  ax.plot(n_estimators, np.median(train_score, 1), color='blue', label='training score')
  ax.plot(n_estimators, np.median(val_score, 1), color='red', label='validation score')
  # matplot setting
  ax.legend(loc='best')
  ax.set_ylim(0.1, 1.2)
  ax.set_xlim(0, 50)
  ax.set_title("Train/Valid ACCURACY loss of different random forest models")
  ax.set_xlabel('number of trees')
  ax.set_ylabel('ACCURACY');
  plt.show()
In [ ]:
# draw the validation curve
valid_score_curve(train_score, val_score, n_estimators = np.arange(1, 50))

3.7.3 ROC curve and percision recall curve.

  1. AUC-ROC curve is reall good on analyzing classification problem in machine learning. AUC stands for area under the curve and ROC stands for receiver operating characteristics curve. The more AUC area or value is, the better the model we have in classification problem.
  2. And precision - recall curve sepecially good at when we have a unbalanced dataset.

Precision-Recall is a useful measure of success of prediction when the classes are very imbalanced. In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned.

The precision-recall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall). sklearn

The function is coming from scikitplot pacakge

This function will draw ROC and percision recall curve

In [ ]:
import matplotlib.pyplot as plt
import scikitplot as skplt
#------------draw_roc_curve--------
# this function will draw ROC and percision recall curve
def draw_roc_or_percision_recall_curve(model,y_test, X_test, type = 'roc'):
  predicted_probas = model.predict_proba(X_test) # get results
  fig, ax = plt.subplots(figsize = (10,10))
  if type == 'roc':
    skplt.metrics.plot_roc(y_test, predicted_probas, ax= ax)  # draw ROC curve
  else: # draw precision_recall curve
    skplt.metrics.plot_precision_recall_curve(y_test, predicted_probas,ax=ax)
  plt.show()
  return 

ROC Curve

In [ ]:
# draw ROC curve
draw_roc_or_percision_recall_curve(forest_cls_final, y_test, X_test, type='roc')

Percision Recall curve

In [ ]:
# draw Percision Recall curve
draw_roc_or_percision_recall_curve(forest_cls_final, y_test, X_test, type='pr_c')
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_precision_recall_curve is deprecated; This will be removed in v0.5.0. Please use scikitplot.metrics.plot_precision_recall instead.
  warnings.warn(msg, category=FutureWarning)

3.8. Analyze the results.

3.8.1 Learning Curve and validation curve.

This time, we first we take a look at our validation curve. Since our accuracy is too low, we would like to know if there is a overfitting by using too many trees.

In [ ]:
# draw the validation curve
valid_score_curve(train_score, val_score, n_estimators = np.arange(1, 50))

However, in the validation curve, we can see that our model won't improve after more than 5 trees. Hence the choice of our best estimators won't be a problem of overfitting by large estimators' number. And random forest is famous by not easy to get overfitting by increasing estimators.

However, if our model can fit this data with those few trees. It means our data set has too few variability. We are impossible to find more patterns in the dataset. That is a bad sign.

  1. Then learning curve
In [ ]:
learning_curve(N, train_lc, val_lc)

And the learning curve confirmed our guessing.

We can see that our model is extremely easy to get overfitted in training model. And it gets near 100% with a very small training size.

That means the words used in validation set doesn't appear in training set. I tried to use all 500 words or 5000 words to train the model, but the result is similar. Since most of words doesn't appear in most of reviewText, it's impossible for our model to asign a correct coef to each parameteres.

Hence, we need to get our words feature as a small group and most importantly they must be important words.

3.8.2 ROC curve

Let's redraw the graph.

In [ ]:
# draw ROC curve
draw_roc_or_percision_recall_curve(forest_cls_final, y_test, X_test, type='roc')

We can see the Area under curve as AUC in our dataset is similar. But class 4 has lowest AUC in our ROC graph.

We would say that class 4 has the worst performance in our prediciton by ROC curve.

3.8.3 Percision-Recall graph

In [ ]:
print("Classification report of the final classifier on validation set:\n\n",
      classification_report(y_test, prediction_final_test, target_names=target_names))
Classification report of the final classifier on validation set:

               precision    recall  f1-score   support

           1       0.37      0.16      0.22       658
           2       0.24      0.07      0.11       819
           3       0.36      0.16      0.22      1834
           4       0.37      0.23      0.28      4267
           5       0.69      0.89      0.78     12423

    accuracy                           0.62     20001
   macro avg       0.41      0.30      0.32     20001
weighted avg       0.56      0.62      0.58     20001

We use our previous defined function to draw this graph.

In [ ]:
# draw ROC curve
draw_roc_or_percision_recall_curve(forest_cls_final, y_test, X_test, type='pr_c')
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_precision_recall_curve is deprecated; This will be removed in v0.5.0. Please use scikitplot.metrics.plot_precision_recall instead.
  warnings.warn(msg, category=FutureWarning)

We can see that except class 5 has a high percision score when recall score increases. Other four classes has a siginificant low percision scores. There is a chance that people who rate the product in lowe scores won't use similar words to express their feelings.

We would see in the original dataset, the prediciton is good on score 5's class. But not good at other classes.

The following is the way of how to calculate all kinds of scores. Wiki

image.png

Task 4. Perform part-of-speech tagging (0.35)

Utility Function

In this task, we need to redefine a few things in our transformer and pipelines for more easier to get our dataset separately. But it's not necessary to write those explictly.

Hence, we put those functions in this Utility Function section for more convenient to use.

In [ ]:
#------------- main transformer ---------------------
# Class for attribute transformer
# import important libray
from sklearn.base import BaseEstimator, TransformerMixin

class combined_attribute_adder_and_cleaner(BaseEstimator, TransformerMixin):
    '''data clean transfomer class'''
    
    def __init__(self, data_cleaner = True, servies_remainer = False, normalization = True): # no *args or **kargs
        # we need to set extra var to ensure do we need to purge the dataset. 
        # In my following experments, sometimes we don't need to do so. 
        self.data_cleaner = data_cleaner
        self.servies_remainer = servies_remainer
        self.normalization = normalization

    def fit(self, X, y=None):
        return self # nothing else to do

    def transform(self, data_df):
        # we first copy the data from our dataset.
        # operate on original data set sometimes is dangerous.
        X = data_df.copy()

        #0. drop NaN values
        # drop vote and image
        X = X.drop(['vote','image'], axis = 1)
        # drop NaN values
        for i in range(len(X.columns)):
          X = X.drop(X[X[str(X.columns[i])].isna()].index)

        # 1. First we change the feature verified with to integer
        X["verified"] = X["verified"].astype(int)

        # 2. purge outliers
        X = purge_outliers(X)

        # 3. drop all useless features and categorical features we alreayd transfered
        X = X.drop(['reviewerID','reviewTime', 'asin', 'unixReviewTime'],axis=1) 

        # 4. delete HTML tag and other useless characters
        X = clean_useless_information(X)

        # 5. clean alphanumeric data
        X['style'] = X['style'].str.replace('Format', '')

        # get text feature
        feature = X.select_dtypes(exclude="number").columns

        for i in range(len(feature)):
            print("Now it's removing number and alphanumberic from ", feature[i])
            # remove stop words
            # first change all character to lower case
            X[feature[i]] = X[feature[i]].str.replace('[^\w\s]+', '')
            X[feature[i]] = X[feature[i]].str.replace('[0-9]+', '')

        # remove stop words
        stop_words = stopwords.words('english')

        for i in range(len(feature)):
          print("Now it's removing stop words from ", feature[i])
          # remove stop words
          # first change all character to lower case
          X[feature[i]] = X[feature[i]].str.lower()
          X[feature[i]] = X[feature[i]].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

        # create new column
        X['text'] = X['summary'] + " " + X['reviewText']

        #6. clean style's space
        X['style'] = X['style'].str.replace(' ', '')
        
        # we put our target value at the end
        target = X.pop('overall')
        X['score'] = target


        return X
#############################PIPE LINE###################################################



# Now we build a transformer to get all the above steps
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


# convert_pipeline is for create a whole pipeline but remain the dataFrame structure
convert_pipeline = Pipeline([
        ('attribs_adder_cleaner', combined_attribute_adder_and_cleaner(data_cleaner=True)),
    ])

4.1 Perform Part-of-Speech tagging.

4.1.0. Question classification

By newest classification from our TA. We can use only two datasets for task 4.

Here, I choose to compare dataset, 1 and 3.

Such that, Q3's result: as dataset 1: preprocessed. And dataset 3, POS tagged after preprocessing.

image.png

4.1.1 Perform part-of-speech on preprocessed dataset after Q2.

Get our already preprocessed data from before.

In [ ]:
lemmatized_text_data = pd.read_csv('/content/drive/MyDrive/A3/lemmatized_text_data.csv')
lemmatized_text_data = lemmatized_text_data.drop('Unnamed: 0', axis=1) # old index is useless now. drop it
lemmatized_text_data = lemmatized_text_data.convert_dtypes()
lemmatized_text_data = lemmatized_text_data.fillna('empty')
lemmatized_text_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 884604 entries, 0 to 884603
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   verified      884604 non-null  Int64 
 1   style         884604 non-null  string
 2   reviewerName  884604 non-null  string
 3   reviewText    884604 non-null  string
 4   summary       884604 non-null  string
 5   score         884604 non-null  Int64 
 6   text          884604 non-null  string
 7   token_text    884604 non-null  string
dtypes: Int64(2), string(6)
memory usage: 55.7 MB

Separate the 880K dataset into 100K dataset.

In [ ]:
from sklearn.model_selection import train_test_split
model_data_raw, _ = train_test_split(lemmatized_text_data, test_size=0.88695, random_state=42)
# we drop reviewText and reivewName, summary and text columns, since we already have token_text and review Name is useless.
model_data_raw = model_data_raw.drop(['reviewerName', 'reviewText', 'summary'], axis = 1)
# rearange the target label feature score to the end
model_data_raw['score'] = model_data_raw.pop('score')
In [ ]:
model_data_raw.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100004 entries, 744031 to 121958
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   verified    100004 non-null  Int64 
 1   style       100004 non-null  string
 2   text        100004 non-null  string
 3   token_text  100004 non-null  string
 4   score       100004 non-null  Int64 
dtypes: Int64(2), string(3)
memory usage: 4.8 MB
In [ ]:
model_data_raw.head()
Out[ ]:
verified style text token_text score
744031 1 kindle edition sigh miss claire jamie vague characters wasted... ['sigh', 'miss', 'claire', 'jamie', 'vague', '... 1
801184 1 kindle edition remarkable book many levels thoroughly enjoyed... ['remarkable', 'book', 'many', 'level', 'thoro... 5
341256 0 paperback buy book laurel hardy fans love book certainly... ['buy', 'book', 'laurel', 'hardy', 'fan', 'lov... 5
734969 0 hardcover enthralling suspense readers familiar ms coult... ['enthral', 'suspense', 'readers', 'familiar',... 4
750145 0 hardcover great story fantastic illustrations cupidandps... ['great', 'story', 'fantastic', 'illustrations... 5

Load Part-of_speech packages

In [ ]:
from nltk import word_tokenize, pos_tag, pos_tag_sents

Get all text from the column text

In [ ]:
texts = model_data_raw['text'].tolist()
In [ ]:
texts
Out[ ]:
['sigh miss claire jamie vague characters wasted minds readers clouded mthis book like overcast summer holiday like dollar store gift wrap like solicitation vs royalty check could reader ore disappointed',
 'remarkable book many levels thoroughly enjoyed book knowing boy title  year old passed away afraid would find depressing even though gone boy catalyst occurs book obsession lists world records motivates ona  year old woman helping chores set goals find meaning life parents equally motivated example especially father never really took time know son alive comes admire find meaning life result discoveries',
 'buy book laurel hardy fans love book certainly warmth affection john mccabe stan ollie comes every page focus laurel hardy comes well chapters devoted himi wish pictures text made dont surprised find whistling cukoo song reading book boys gone laughter created still us unlike current crop comedians',
 'enthralling suspense readers familiar ms coulters fbi couple savich sherlock pleased husband wife team makes reappearance cleverly scripted novel father michael joseph found murdered confessional san francisco identical twin brother fbi agent dane carver wants answers help homeless woman nick jones inspector delion san francisco pd dane searches killer apparently confessed murderous sins deceased priest nick discovers killer parroting television show consultant feds entangle investigation complete car chases nerdy tv studio execs complexity read enhanced underlying mystery surrounding true identity nick jones wellspoken homeless woman obviously running away someone something target witnessed father michael josephs murder someone running kill dane seeks answers keeps nick side sparks fly two dual compelling mysteries enhance anticipation creative version whodone slight disappointment near novels conclusion occurs completely separate resolution two intertwined tales fans ms coulters savich sherlock series captivated readers new series likely intrigued well',
 'great story fantastic illustrations cupidandpsycheesque story brave heroine must prove faith one loves p j lynchs watercolor illustrations incredible loved book little still enjoy today',
 'great campy scifi fun s first read book kid s old musty seemed thick brick paired sequel worlds collide great story planet space hurtling towards earth way thwart like would today discovered new planet bump earth cold space take old earth orbit plan hatched launch jules verne style rocket ship new planet story deals building staffing ship big day earth gets rightenglish side pocket great story fun read dated glory sequel worlds collide deals adam eve story new planet convenient lead lady called eve story good first book still fun read   highly recommended',
 'splendid reference reference used conjunction microsoft excel  step step large fonts large color diagrams references numbered pointed thus little problem locating anything author referencing printer would develop books following course comprehensive would also include library thus book microsoft excel  step step microsoft inside two later black white illustrations exercises chapter publisher provides pdf format charge enables printing various portions feel three books invaluable library',
 'true southerners clear expression southern soul clear truth familiar many ways thoughtprovoking others',
 'ok book quick read really enjoy plot arranged marriages two characters secret woven story compelling made book interesting would without best thing two female characters portrayed strong independent women reading books like one would swear indian women unhappy marriages indian men abusive controlling kind thing reflect flaw writing story guess cant rate strong characters strong plot make worth reading',
 'ok ok',
 'fun really cool book found many things didnt know would definitely read another book like',
 'solid follow second book series stands way first mr cole certainly found voice pacing lot even buildup done brilliantly internal struggles came fore book battle scenes done breakneck pace brilliant book',
 'couldnt get enough cant wait loved book think reader someone enjoys legal thrillersdramas youll enjoy story already downloaded next  books series anxious start themthe main character joe dillard defense attorney looking  innocent client leaves practice law comes across many cases terrible clients never believing innocence meets angel christian woman accused murdering preacherthe story unfolds realistic ways found engrossed story predicted parts outcome right wrong others hopefully keeps guessing',
 'terrific read great book interest knowledge istanbul modern turkey read fascinating wellwritten devoured quickly real pleasure read even think interest subject',
 'book cracked funny book im going get first book hope good better couldnt put book',
 'five stars awesome fast delivery',
 'good book good book great photos',
 'still writing great story another good story joe picket series best work bad either still entertaining disappointing',
 'fantastic lora leigh always delivers best books dont know fantastic read books seriesbreeds keep getting better already cant wait next book thank much lora leigh writing another fantastic addictive book recommend book everyone disappointed',
 'great book love books earlene fowlers benni harper mysteries fantastic hope writes would right read',
 'tour time france book combines history mapmaking exploration linguistics ethnography real storytellers flair history never neat tend think grubby filled discomforts strange often hostile people reading book gives armchair tour centuries migration rural life gaul without fleas mud hunger couldnt put',
 'sad also eye opener author describes eight poor families going pay rents author points evictions use rare occurs frequently today rents higher people little live properties often disrepair yet rents keep going higher sad book one read discusses topic often talked openly yet need find solution toi received book free blogging books exchange honest opinion',
 'another case solved like characters easy read cat youll love diesel twist turns ever read nancy drew hardy boys like one',
 'interesting book interesting look lives astor family quietly took one history making history come alive',
 'great read ive read c j box picket game warden stories always good read especially become familiar characters',
 'couldnt put riveting tale twists turns keep guessing end read book one day glad started morning else would lost sleep interesting believable characters story moves along good pace love references interjection humor',
 'five stars great character development terrific world building',
 'nice long book covers boston tea party battles lexington concord searching work historical fiction recreates boston late colonial period try johnny tremain classic reason settings characters richly described totally believableits entertaining story children remember several moving speeches mankinds thirst libertyit newbery  like newberys seems push envelope noticed three things beginning book heavy gods punishment falling johnny payment sin pride drum beaten over sons liberty hancock revere adams gang shown deliberately provoking british outrageous acts order bring desired end war plenty language family shy away repeating examples sluts hellforleather plenty swearingtogod damned yankee many times damned many times hell interjection anyway read aloud kids edit outupdate binding book softcover pictured disintegrated second time book bad',
 'five stars surprisingly complex storyline hope none reviews given away endings',
 'series keeps getting better jane yellowrock shapeshifter sometime vampire killer vampire protector continues evolve person magicalspiritual creature never read prolific writer manages sustain deep interest long series know series another gem dont encourage start book one best possible beach escapist fantasy possible gritty authentic funny tender amazing writing storytelling',
 'exciting book exciting way characters lot fun plot surprised every turn',
 'three stars difficult follow subjects jump around great deal last chapters excellent',
 'book review book really cute always gets good laugh daughter read together would highly recommend book interactive imaginative',
 'cute temping hope tarr decent book seem ms tarr really give reader fresh material young innocent girl hadsome rich man older woman able turn young innocent lady three months course evil cousin story told many times beforethis book could alot better ms tarr done plot',
 'great great series would recommend anyone interested dystopian stories',
 'five stars beautiful',
 'brash rude sometimes funny pretentious martin amis seemingly one folks simply smart good intelligence percolates novels including first novel rachel papers prose well done feels overcooked supposed narration oversexed pompous brat yet vocabulary used would confound mensa members intellectual arrogance wears thin awhile despite humorous touching moments overshadows goodthe rachel papers rather snotty nineteen year old trying get oxford university circa  strange reason documents main characters life including love prospects walks rachel somewhat balanced individual seems surprises love near love sex graphically detailed teenage trauma abound martin amis also throws rather good observations teenaged angst overall style one intellectual arrogance shame reallybottom line story brat limitless ego libido good reading material really',
 'things matter charles krauthammer one favorite commentators found book interesting enjoyable read admire courage understanding issues day',
 'fantastic easy read fantastic easily read tale scientists journey research history human genetics read dr sykes later book genetic origins british isles ireland book prequel good introduction dna analysis research process development',
 'excellent good full insights examples really liked duhigg explained specifically used insight gained researching book help getting done im giving five stars thats perfection think theres lots potentially life changing lessons well worth reading applying peoples life',
 'double exposure life pictures someone threatened kill last night man following opening words set tone love inspired suspense novel jennie buchanan photographer loner life work threatened isnt sure gallery hosts charity exhibit vandalized owner hires local firm protection head serves jennies main bodyguard turns old boyfriend ethan justice dropped years agojennies painful past makes difficult trust others ethan still seems interested doubts could make relationship work life dedicated photography helping orphans mexico story also addresses issues rejection fear loss forgiveness think everyone relate read realistic fictional characters experience often gain insights lives change better book brings message hopea bonus double exposure set portland oregon enjoy settings home state thanks susan sleeman justice agency installment look forward',
 'ngela selene review dublin street braden serial dater jocelyn antirelationship girl see book tells strong willed braden get jocelyn heartthis best book series book similar bared fifty shades grey book tell jocelyn fall love alpha male like braden carmichael braden try hard get jocelyn love writing pretty good addictive keep reading reading find ending sweet love jocelyn bradenhonestly like story plot book really good really enjoy romance alpha male type book yourating ',
 'unusual beginning received copy author due weather delaying mail got two days ago read many contempory novels loved first historical know must pick give spoilers let say story right beginning caught attention held h h felt real secondary characters well developed felt like routing hea way author delivers please note list kindle edition received paperback copy',
 'delightful thoroughly enjoyed unique takeoff sherlock holmes series looking forward next installment',
 'riveting according brief bio back book author blunt written scripts notable shows law order clear brings lot table fine sense narrative timing strong skill characterization crisp writing style gifts come play forty words sorrow outset wretched cold integral plot characters blunt succeeds evoking climate mythical northern ontario town local police force personalities clash collide ultimately pull together solve mystery murdered number missing kids finely wraught characterizations stock character sight issue mental illness addressed sensitivity insight point narrative focus shifts villain might interior rationales behind crimes momentum moves high gear stays tension spread across several lines detective cardinals anxiety secrets personal sorrows heightened fears sorrows related victims ongoing investigation cardinals past new partner lise delorme torn conflicting emotions relentless ambition cantputdown book lean taut ill eagerly waiting next serieshighly recommended',
 'five stars old theme',
 'captivated eternal captive simply could put book love laura wright told love story lucian bronwyn',
 'highly enjoyed supernatural thriller novel constance green takes larger role something found enjoyable another character seen far long secondary character evolving tibet hidden monastery pendergast studied find constance pendergast something stolen pendergast tasked finding journey takes around world constance join forces find thief large luxury liner heading americathe supernatural human blend seamlessly book grab attention captivate would confronted greatest fear people cruise ship find ancient evil let loose pendergast constance stop one ends wanting toevery time think found new favorite pendergast book another one takes place exception characters motley crew characters find cruise ship brilliant writing descriptive almost mesmerizing',
 'great book good book feel good message movie actually bit better book things made great read',
 'faerie never looked scary karen moning taking us new direction book faery seems story line wonderful keeps reading waiting love new series highly recommend new twist fantasy tale',
 'five stars nora roberts always gives good read',
 'empty nothing author',
 'thanks love make product everything always comes time everything also good condition thanks package',
 'six frigates little known history wonderfully written book read anyone interest political military history young republic united state book covers navigated precarious early existence european powers well competing political philosophies dominated early think interesting issue building navy center politicking led early demise adams political voice end along others vindicated personalities adams washington hamilton burr come life well early commanders builders six ships ships able stand megapowers time amazing role ships played early survival crucial every american history buff sholud read bookwell written informative highly entertaining',
 'five stars terrific plot lot action twists sets cussler apart writers',
 'great beginning readers bought two two great granddaughters parents liked',
 'read readers amazing account reading ive heard corrigans reviews many times nprs fresh air impressed sensible takes literature someone like pile halfdozen books bedside tome welcome visitori dont believe ms corrigans book bestseller seems thoughtful exploration relationship pathological reader many loves would buildin interest amazons book shoppers nothing else invaluable giving one new ideas authors books exploreoh please note book really memoir meditation one readers lifelong list books books influenced sense self years',
 'mcd words like facts facts backed dealing subjects deal survival countrys economic status strength especially americas tom f book offered nonethe facts provide referenced dont know pulled fromi like books offer real situations situations similar us experienced outsourcing tom seems spoken wrong people sourcing  fail rate outsourced applicable timed correctly production process global time wise real life experiencethere also question creditability tom f bias towards import carsits reputed owns toyota lexus suv suv gets approximately mpg yet assaults american manufacturer producing gas guzzlers hypocritical writings may make portly girl friend micheline maynard happy nothing crediablityi like books enlightened entertained educate generate good conversationthis book accomplished',
 ' year old review book awesome squanto taught jack annie plant corn really cool taught lot pilgrims lived',
 'mike really good book informative gives head start plan glad read book would recommend anyone looking kind guidance become financially free',
 'better liked book better lot nice see good story like characters location keep great writing',
 'excellent story great comfort food book western highest caliber book one hates read fast',
 'condition books books worse shape advertised ads booksi know future expect',
 'samantha read almost everything written samantha young love since dublin road first sy book weak spot series never forget first thought loved braden jocelyn think might enjoyed jo johanna cams story little better guy oh cam loved dublin road absolutely love book definitely read book recommend friend check sys books great variety uniqueness appealing slumber favorite',
 'running steam first id give revelation space chasm city  stars masterpieces book retains brilliant energy gnarly braininess feels padded toward end ideas plot good could  pages hacked far better book still good read think ill lay reynolds awhilei feel little betrayed',
 'four stars wonderful book',
 'found thoroughly enjoyed straightforward style since tom clancy passed away wasnt familiar writer authored jack ryan novel found thoroughly enjoyed straightforward style theres probably better way say characters remained character consistent previous books series dialogue consistent versions seemed like listening tom clancys jack ryan looking forward next one',
 'ok book ok wasnt exciting mystery part weak considering alot suspects also didnt find missing deed aspect interesting coincidence daughter bright ideasalso alisons attitude especially police detective annoying like elderly reporters character would nice see take part investigation maybe come suspects motives',
 'five stars good',
 'four stars good',
 'great series love series progressed cant wait next book otherworldly series links well reality giving believable progression events concerning coming fairy tale creatures culture characters grow book book feel know jane yellowrock wonderful lead character',
 'inspiring enlightening interesting started crossfit recently wanted read new obsession book well written delves history crossfit well amazing humble athletes populate boxes gyms one favorite chapters company makes lot equipment crossfitters use rogue fitness business philosophy template american businesses craftmanship great customer service treating employees well fundamental ethic book people work hard lot heart opinion improve human race every wod work day perform',
 'five stars awesome series great author keep coming',
 'fairstein awesome terrific service product love good read one disappoint eager another adventure author',
 'old edition latest published  review quatity book far find full excellent advise informationthis old edition shocked discover third edition book store last night latest edition contains updated information current treatments latest financial information nursing home legislationso buy book get third edition paperback different cover',
 'ode national parks terry tempest williams environmentalist author poet tours number national parks shares deep spiritual experiences love nature outrage destruction treasures even history need understand writing often poetic stirring occasionally format loses reader never dull one comes away deeper appreciation american treasures',
 'sometimes forget loved sometimes get comfort zone appreciate novel always terry mcmillan fan',
 'harlan coben think figured author harlan coben switches things throws big surprise fool another long line books author great twists well developed characters maya exception suffering effects ptsd due particularly horrific mission special ops pilot military faces murder husband sees nanny cam maya installed watch two year old daughter throws whole life question maya frantically tries put pieces together zeal truth undoingfool great ending one definitely expect read harlan cobens books figured ending twice mysterythrillers favorite usually surprised high praise author highly recommend fool counting days harlan coben surprises',
 'coben another good story coben one interesting keeps attention',
 'amazing book outstanding book lets know books read serious intense maybe well one takes cake interested history one forgets parts history want forget book tells us never forget drags little beginning dont let stop gets better much better highly recommend book',
 'lover much enjoyed reading short book became interested reading actual book viewing movie',
 'gold standard urban fantasy ive paperback forever wanted kindle good format also liked commentary movie could beenbut think im glad didnt make thats thing booksalmost nothing touch movie make head theres room secondary characters full lives ownthis faerie donebeautiful dangerous mortal equal',
 'five stars love denver cherry book',
 'five stars great original jack reacher story got started reading rather waiting movie',
 'simplistic needs character development complexity moments entertainment part skim story get essence',
 'awesome novel sanford shines john sanford cant put novel love characters uses lucas davenport family detectives virgil flowers shrake del etc suspense always characters human story line super hope lucas resigning atthe end book doesnt mean sanford going write anymore would never turn book',
 'exceptional psychological thriller police procedural first lisa gardner book somehow seem started th book eightbook series matterit excellent read standalone enough reviews dont need reiterate others said let add  author brilliant job taking worst horrifying kind subject matter relating story readable way lingering horror gratuitous wallowing nihilistic depravity sign true artist suspense ratcheted  without making reader live horrifying violence minimal appropriate torture finest sort psychological thriller bravo also awesome tale redemption',
 'ok first think itll boring religion see baseball school neo come story good summer read chair thats presentation chapter  got',
 'entertaining fun twisted story mixes outwardly mundane actually quite dramatic lives three women families easy fun read keeps guessing def recommending girlfriends little dark little sexy middle young readers warned',
 'great great book hitchens uses obscure pretentious references ruin good points tries make',
 'page turner real criminal th story stayed hidden end never suspected villain dont miss',
 'slow steady wins race dying read book mainly buzz keeps popping newsfeed finishing xmas shopping decide treat myselfthis slow starter wondered made mistake buying hype knew fully invested eagerly turning page page amazing storyi love attention detail back story laid foundation booknot feel rushed forcedthis deffinetly book needs shelf tree',
 'enjoyable well thought reviewers didnt get dont understand reviewers wrote betrayer end certainly featured throughout book bounded world integral source clearly explained none hinted previous books neither plot contradict themi much enjoyed spending time gereck discovered missed loving caring thoughtful karon hard time reading portions felt added story rather detractedi almost didnt read book reviews im glad',
 'lifes work well written superbly researched book tho well done times glaringly transparent depiction unparalleled tragedy recent history sad sick hollow thanks taking time share story us may help prevent future insanity',
 'great read excellent theory wild theory applicable real life book gets little congested times message always theregreat read excellent theory',
 'reacher always follow reacher travels didnt disappoint didnt quite excitement previous books series could reacher getting old',
 'john corey demilles best character date entertaining times literally laugh loud amusement john corey demilles best character date books series except panther exceptional',
 'five stars kept interested',
 'bentleys worst ever house unspeakably stupid difficult find ways describe pacing plot characters ending goresex contenttheyre extremely flimsy best  year old daughter written gripping stories house even worse walking must taken serious effort able accomplishthis convoluted confusing mess ever got published beyond youre looking scrap paper burn fireplace house definitely book buy',
 'really enjoyed book boy family crisis really enjoyed book boy family crisis realistic writing healing therapeutic also story within story keeps reading disappointed came end wanted continue',
 'eventful fastpaced favorite quoteif beatrices eyes narrowed wouldnt able seemy reviewholden fastpaced action packed eventful storyline featuring multilayered cleverly twisted multimillion dollar scheme involving skeezy fertility clinic henchmen local lawmen fbi senator gold digger trophy wife probable police corruption kidnappings murders blackmail ransom explosions several shootouts rekindled romance occurring amidst explosions',
 'boring preconceived notions book sale local c store ive enjoyed books peter straub past picked id read reviews even better ones id probably passed bookthe problem end result already known reading  pages mystery basically knew group people seance kind evil member group killed  years later rest book devoted main character seance interviewing book endsboring',
 'nothing like thought book  good one better thats hope seriesthat one gets better writing tight unnecessary subplots boring details characters several story lines going contributes part overall main plot especially enjoyed authors detailed description every character right type glasses nose way picture character becomes person know therefore become involved writers becomes boring qualities become overly dramatic pratt get drawn deeper book also enjoyed overview justice system judges attorneys court staff work within everything described everyone part book  authentic almost verbatim experienced working judges chambers layman true scary depiction system ending came together beautifully looking forward reading series something never done author always found series books tired repetitive plain boring far nothing like seems part scott pratts writing',
 'must read jane yellowrock fans compilation short stories long ones great fillin times reference havent able figure reference hit lightning',
 'great read love rough riders blacktop cowboy series muchi gave  stars didnt know expect started would well written good quality storyit avery ronin great pair glad get pov endit really needed think avery little naive comes frustrated times didnt realise book ends cliff hangerbut happy move straight next installment',
 'thick pages intricate fun designs going give christmas gift love much treat destressing christmas lol',
 'awesome price awesome arrived time priced right',
 'fun book nice dark ironic touches good story nice twists well developed characters good fast read',
 'art children interesting makes curiosity great way entertain teach art children arent even aware learning something use',
 'magnificent farrell kind comic tolstoy except vastly underappreciated novel alllaughs history gallantry foolishness great atmospherics course luscious prose like vanity fair except thackeray joked among leading war novelists whereas mantle fits nicely farrell get cant go wrong',
 'love book  year son old laughing love book',
 'wonderful story wish friends today like wonderful story wish friends today like',
 'mishimas unseen solitude written  mishimas fourth novel one stands famous works number waysmishima mostly known character introspection dark mysterious protagonists whose psyches peeled away chapter chapter sound waves isnt like like many mishima novels ive readits simple love story kind reminiscent one told spring snow although given minimal brush stroke writer storys setting takes place small secluded island mainland japanthough would easy label typical japanese romancedrama really shows writer coming shell holds beautiful passages would attribute mishima writes way personal guess im little biased great piece worklike tide pulling sand beach mishimas place literature becomes apparent novel progress',
 'five stars uses many f h words doesnt thing story line',
 'four stars enjoyable read characters depth purpose',
 'lots great vonnegut forgotten much liked vonnegut havent read dont care read reread one novels book get straight dope mans mouth many interviews found bumping around section section haphazardly realizing finished scattered',
 'great read plot takes place beautiful loon lake wisconsin characters involve woman sheriff retired dentist many characters seem like real people could figure plot end cant wait read rest series would love visit wonderful place please read book disappointed',
 'another gem max allan collins suspend disbelief literary devices max allan collins uses disaster series nate heller novels thats something ive done easily think youre treatintelligent witty suspensefulloved',
 'good book exploring aspects american history good book exploring aspects american history recently coming light full detail provide good explanation us tough times keeping union together',
 'spice meat book neither turtledovestyle alternativehistorynovel virtual history speculation niall ferguson collection afterdinner mindgames pieces american history short speculative tale added spice style journalists report historian novelist makes easy reading perhaps suitable transatlantic flight something spend much time onwhat really lacking uniform approach historical sitations subsequent speculation reported sometimes inconsequential progress history turning points change everything reader wont learn end bill gates always end owning computer industry spice meat',
 'beautiful thoughtful profound book touches deeply lovingly many todays spiritual issues bought copies book people life care',
 'review night school classic jack reacher conspiracies terrorism investigations fbi cia nsa military intelligence interwoven gripping tale slow start leads unexpected ending read book one sleepless night couldnt stop reading',
 'surprisingly good read written well keep reader glued story little history found story interesting',
 'second book bought astronomy love first book basic one great pictures good even teenagers one like text would use class wanted interesting stuff happy searched around decided books get',
 'great series really enjoyed book good teen fiction need buy book i wish books cover short enough',
 'historys blinders readers reading readers comments appears ones guilty bias slanted criticism yes difficult read book start presumption christian warriors chosen ones however want read actual vs christian history book illuminating details general analysis cause effectone cogent points book brought early entire crusades image christian warrior propagandainspired selfaggrandizing falsity based lies rewriting history sounds familiar modern readers see continuing evidence daily newsanother point armstrong emphasizes correctly essential islam earlier historical phase live let live philosophy internecine conflicts continually changed aspect continues within christianity individual sects cults secular profiteers always arise within large scale religionsthis book required reading foster discussion comprehension particularly among dealing foreign policy cultural history dont understand armstrong saying interpreting wearing historical blinders',
 'truly unique book thoroughly enjoyed reading mr penumbras hour bookstore story good progressed proper pace filled interesting characters book also features fascinating conundrum characters solve intriguing behindthescenes look googles corporate culture read lot please understand gravity next comment book originalnot like anything ive read youre looking something wonderful different give mr penumbra try fun try guess first name revealed story',
 'breathtakingly delicious cry write say book hurts beautifully written terribly sad worth',
 'karen morning writes phenomenal story iced one best stories year dani omalley  years old forced grow much faster dani abused child actions reflect abuse especially dual personality protectiveness towards weak book dani men life although love christian dani someone like dancer respects wont chain cage cage regardless whether white mansion dungeon im excited learn ms moning takes series',
 'interesting bad ptsd interesting premise make feel better substituting good thoughts bad memories idea works youre treatment ptsd emdr rapid eye movement therapy licensed therapist ive personally done several sessions emdr old childhood stuff found invaluable ptsdit developed treat vietnam vets ptsd workshowever without emdr exercises book made feel bad panicky suspect would work way make brainbody connection work suggestions breathe variety breathing helped ptsd one would actually felt worse trying several exercises ptsd even mildly would avoid',
 'lots comfort food old lots basic recipes nonsense frills yuppie format grandmothers recipe book lots recipes comfort food',
 'best book read dan browns books best book okay old old',
 'amazing book ive long ann rule fan true crime section bookshelf focuses mothers typically single divorced separated mothers kill children personally think one chilling level antipathy dianne shows whole book astounding fact thinks shes acting completely normal saying needs go back work cant miss day two surviving children struggling hospital thinks nothing like behavior completely normal eyes still cant imagine mother lost one child two critical condition telling police wont help find killer theyre aholes themselvesms rule chilling job portraying dianne downs ill admit theres sections drag bit longer necessary typical ann rule books namely delving path killer personally dont care daddies people probably wont complain much amazing book',
 'rich facts evenhanded even us studied crusades learn much book armstrong digs deep events crusading era providing freshly perceived context military religious ventures learning impressive objectivity less armstrong condemns religiously motivated aggression western european christians passes much lightly earlier behavior islamic conquerors also driven religious zeal one point writes obvious muslim ideal holy war different crusade essentially defensive whereas crusaders like jewish holy warriors made holy initiative attacked enemies god chosen people yet earlier book written duty muslim state house islam conquer rest nonmuslim world house war world could reflect divine unity morally preferable crusading theory crushed islamic expansionists seventh eighth centuries seem forgotten ask iranians feel muslim conquest persia memory hardly golden',
 'must reading every eighth grader disapointed book everything mr desoto says true course important understood many including many west interested passionately involved eager read instance real impact liberation theology land redistribution protective limits true title transfer options living standards poor unfortunately people need read arent interested learning importance property rights rule law inconvenient worldview interested almost certainly thought read enough subject find mr desotos book basic tediously repetitive bad',
 'great read appreciated factual nature book though tere survived horrible tragedy youth never gave woman admired courage strength',
 'good read feminist perspective art history going back forth two periods two skilled artisans art history feminist reconciliationbut flat love story',
 'excellent book whether youre writing fiction nonfiction book author knows teach read book follow instructions become better writer better reader learn narration one favorite sections identify write grow writing relevant details much prose come appreciate writing authors even get book wont regret',
 'crazy chronology enjoy anne tylers writing characters great wasnt flashback chronology one',
 'heart im going rehash everyone else said disagree think get point really want say disappointed great series gone downhill much last book worst felt rushed like writer ready done tie call night wish hadnt preordered likely wouldnt purchased reading reviews wasted money',
 'recommend good read great fifty shades liked lot highly recommend liked fifty shades',
 'five stars beautiful journal helpful prompts reluctant journal writer',
 'another backman winner love backmans style writing reading beartown many times stop reading reflect characters come alive though know feel theyre going story lasting affect hope becomes movie',
 'another great addition hope come another great addition hope come',
 'five stars best selfhelp book',
 'awesome awesomeeighteen words required guess earth eighteen words',
 'exceptional read plot drew immediately held attention last page could put book read beginning end one day',
 'five stars one best handled soldiers tragedy humane way loved',
 'hooked read one books hooked reset set wish known five buying next',
 'five stars really enjoyed characters',
 'daniel silva seemed hurry end problem book way two half stars fifteen degree turn near end book seemed push rest story way end book quicklyand gone poof',
 'better book  glad didnt give series book  moves quick see turns twists coming quick read makes want get right mockingjay book',
 'engrossing scary favorite jack kilborn book loved characters story kept guessing wanting know like really intense story dont mind violencethis couldnt read fast enough head get trapped next kilborn book soon afterward amazing book',
 'many brands thank exciting story garrett chelsea littlest cowboy heartwarming also tense frightening kidnapping chelsea whole brand family came rescue thoroughly enjoyed reading vidalia five daughters really enjoyed story caleb maya whole brand family pulled together look forward reading work',
 'greate book life incredible woman told humor book drama also gives one good laughs',
 'admirable life relatable intelligent interesting follow olivers life humble loving humans whatever condition present always obvserving everything man science man humanities good read',
 'lacked drama fireworks read book day half ok read missing something read ms schusters books looking forward reading one one pretty mediocre wish would concentrated jason wainwrights lack trust came women many references really explored also book lacked fireworks passion two leap pages like previous books book working man chain could put expecting thing book like authors books continue read one wasnt really great nevertheless gave  stars',
 'great review modern conspiracy literature book excellent survey conspiracy literature central thesis book s s crosspollination ideas right wing extremist groups secular christian varieties ufologists traces history main themes current conspiracy theories eg world run secret cabal variously trilateral commission council foreign relations bilderburgers masons jesuits jewish bankers etcthe negative reviews amazon obviously conspiracists spend nights art bell george noory certain amount paranoid doesnt mean arent get stuff pretty far fetched barkun says evidence mostly sources crossciting otherthere two problems barkuns book  blind conspiracy theories adopted rightwing new age movements vestiges hard leftwing dont know political leanings academics socialists greens may explain omission  subtle ridicule rightwing apocalypses finishes book overthetop apocalyptic warning conspiracy theories passed popular culture makes likely lead violence social upheaval',
 'entertaining definitely looking stories author felt like picked book enthralled',
 'read third time book read third time book passed story page comfort object',
 'wonderful book sparkled passion life characters drawn delicate sensitive hand weaving music literature human stories masterful couldnt put',
 'book sucked would ask everyone pick old spenser book written robert parker read spenser book written atkins realize miss atkins half writer robert parker iwas since reading robert parker books since s told spenser korean war vet true spenser hawk s years go find hard believe hawk spenser punching people one aspect spensers life tv better since robert urich spenser vietnam vet',
 'wonderful read whole mitford series written books continue stay connected peole',
 'excellent charles krauthammer speaks truth love booki loyal follower charles country needs people like',
 'sorrowful sure like many readers age  remember america much relaxed cordial second gilded age yet started book laments loss republic united states beginning empire berman rather grim hypothesis convincing us kept events last half century fear founders feared gentility true liberty social responsibility among much would replaced materialism sacrifice soul country came away sobered sorrowful happy childless tone little sensationalistic accusatory yet place todays america national outrage even much awareness incessant shopping even problem although rich realize happy people country lost souleven skeptical persuaded materialism cost us spiritual lives resisted dismal tone lack hope yet could put',
 'five stars yet another nr book fell love characters real lovable',
 'great book love book great hand help remind things otherwise might slip mind',
 'another great read enjoyed book much recommend adults linda howard one author always enjoy grab copy get started',
 'great series great series getting ready read next book feels much like  shades oh well still like',
 'learning history fun way said fail learn history doomed repeat example viet nam teach us play hide seek enemy hides among local populous casca eternal mercenary depth study military history presented way enjoy even arent egg head thing sadler wrong turned jesus prince peace vengeful character first important story great shame author died speculate one roman dog face would punished severely',
 'five stars advertised quickly shipped',
 'especially view super storms weve last couple decades author tackled huge topic managed bring personal level extremely interesting especially view super storms weve last couple decades',
 'good words live thanks writing book mr pressfield im sure think ideas thoughts book years come',
 'best book lacks humor many books began reading bridgerton series first since decided read julia quinns books however first one really disappoint couldnt see chemistry like characters',
 'good reading lighthearted book easy read',
 'blood transfusion needed asap ok book fun extremely short story line simplistic short read cant help wonder charlaine harris running steam always found sookie books way better hbo series true blood something dead reckoning makes feel though ms harris pandering hbo fans obsession bill eric much bill also miss jason friends miss tara old black magic simply missing book yet ill probably buy next book series hoping next book worth wait',
 'amazing epic tome must must read anyone sense humor part three volume fictitious almanac laughed cried',
 'awesome love way karen white writes every one books isnt one wouldnt give  stars',
 'five stars great great story',
 'good book fan north carolina college sports book read great book arrived perfect condition book gives lot insight coaching recruiting conduct college sports',
 'spurrier best smooth transaction go gators spurrier best smooth transactionhighly recommend seller',
 'four stars thank',
 'brilliant useful insightful helpful especially anglosaxon countries foreigner got interesting culturespecific insights would recommend everyone',
 'together im amazed book one author write beautifully add dirty sexy love scenes best author loved seeing past characters reliving bits stories one best series hot passionate fantastically write must read like others',
 'good looks british accent enough misha hurt terrible things talib accused would editor allow author misha making oral sex early book female lead characters always giving early ms washingtons portratyal misha weak trivial certainly smartthis book lacks good plot reason misha talib feuding serious enough riting weak romance plot story would hot helldont waste time money kimani really hire editor sigh waste',
 'brave little girl wellbalanced adult brave little girl wellbalanced adult turned considering hell grew mind boggling people live way another mental illness stigma enduring toughlove parents admirable hope finds peace',
 'hurled book wall started reflect darn hard write screwball comedy writer likes put heroines predicaments either embarass humilation show plain bad judgement time time character makes bad choices lead mortifying results got tired quit rooting win screwball comedy fragile delicate procelain teacup book hammers w mondosized beer mug instead skip sad puppy give aisling gray series peek instead',
 'good quantum read fine book beginners interested quantum physics laden heavy math comments little dated others show amazing prescience latest developments modern physics',
 'cute book gets reading homeschool  year old daughter earth day read book watched movie lorax cute book gets reading teaches little ways help earth',
 'relationship tensions suspenseful mystery great heroine grace cries uncle suspense novel sixth book cozy mystery series dont need read previous novels understand one story didnt spoil previous mysteries series always suspense end heroine focused solving murder mystery novel grace focused uncovering series mysterious happenings include murder making sure people cares didnt end dead whatever going onthe suspense partly relationship tensionsis bennett relative sister also physical danger mysterious happenings clues going whodunit scenario came using clues turn correct one author wondering right final scenes though whodunit solved whodunit confronting grace case knowledge _did_ something whodunit wanted grace wasnt stupid proof guilt acquired process goodthere sex minor bad language case use god exclamation overall id highly recommend suspenseful novel',
 'five stars love jack mcdevitts novels',
 'handy guide managers leaders latest edition rules tools useful reference many daytoday issues challenge managers leaders chapters how best incorporate new electronics management the motivational problems posed notforprofit organizations boards welcome addition checklists references give guide added utility',
 'fascinating era flamboyance greed fascinating era flamboyance greed',
 'dietz good even bad days readable somewhat predictable still dietz good even bad days ive read books three books best helps pass time',
 'anticatholic hysteria becker following footsteps dan brown authors genre inserting antagonism catholic church catholic offended book wont reading beckers books bad produce riveting scenes escapades characters',
 'newberry valid award two things possessed read book first seen library shelf someodd years ago wanted read second going newberry award winners always good booksthe whole book took  hours read honestly say written adults children characterizations good easy enough children understand deep enough adults understand also telling could see boy eyes others without read book long involved brothers karmazovthe author also knew fair bit detail trucking something done constructed plausible plot trucking breaks families workingclass familythis great book young people dont regret buying brief bit time spent reading',
 'great really good book read book report thought would really bad turned great typos great book especially want learn great depression',
 'read silver linings playbook wonderful read many things going many people human waiting release dvd want compare amazing novel',
 'great book provided enjoyable reading great authorall griffins books proven well worth purchasing keeping shelves',
 'five stars best booksfaye stephens',
 'amazing best one please keep writing continue characters lives want know happens next',
 'insightful somewhat conflicting  stars indicate worthwhile book read something jacks philosophy management really struck contrary management manage intimidation fear ceo requires cut  staff every yeareveryone ever involved corporate politics know conflicting objective highly unlikely impossible obtain wouldnt fear management feel initmidated know mangement must cut replace  staff every year style management using intimidation fear bring results people pure simple jack welch deny disingenuous kidding something preaches one never',
 'four stars advertised',
 'soso sequel sequel jacqueline daughter eglantine enthralling predecessor countess still interesting tale bit dragged',
 'five stars good read',
 'excellent awareness read good reading lots useful information disaster survival prepared different situations gives good advice awareness give  stars well written informational',
 'another good resource another good resource like interfaith minister alternativecomparative religions need library texts resources joseph campbell compelling author',
 'another great book sanford didnt think would enjoy fn flowers much davenport character new book series makes like moreanother great job',
 'three stars shorter expected uplifting nonetheless',
 'boring spots slow moving boring predictable',
 'loose ends tied created two graves offers insight closure pendergasts private demons creating even always enjoyable read',
 'great gift bought stocking stuffer grandparent everyone thought hilarious cover whole book funny well done',
 'sydney brides book  instant attraction leads soul searching third book kandy shepherds sydney brides serieswhen party planner eliza dunne first meets jake marlowe friends wedding cant deny draw feels towards hes available simply enjoy chatting getting know eveningonce jake free finally opportunity give attraction time together magical neither wants anything serious part ways run eliza shocks jake news shes pregnant wont take answer childs life hard headed behavior causes tension themwill jake figure get eliza together raise child strangers',
 'shows promise interesting series pandora athens well done reader may lost parts basic plot beautifulit illustrates average life thirteen year old girl living greece  bc pandora lives era women rights respected resents also resents fact upon fourteenth birthday marry man twice age cousin quite lothesome himselfa good addition collection historical fictionit shows promise onetowatch kind series',
 'fictional search arthur well written nod welsh arthurian history mythology folklore poetry archeology scholarship master storyteller includes enough action',
 'well met well read well ended treasured enjoyed everyone one brother cadfaels adventures honestly say perfect ending series war stephen maud exhausted people served backdrop much series becomes almost character rapidly shifting tides political instability take toll relationships characters forcing grow fast others ponder place midst chaos like murder takes second place indeed almost solves personal struggles cadfael son fitting end cadfael uses wit keen mind bring mercy grace needed help need regardless allegiance save day much humanly possible upholding much vows doesindeed keeps far rule breaksis testament great character shows much missed',
 'wow thought book beautifully written reminiscent blame huneven last year story prisoner road travelled writing complex fits subject sure know story going twists turns keep drawing whether like noa hate pity find arrogant questions family damage done early make wonder death penalty character story would make great book club book',
 'golden age scifi pulp gulliver foyle third class machinist dead end job slow witted brutish specializes solving problems fists coasts life little possible changes finds marooned aboard ruined ship abandoned passing freighter transformed single minded force nature gully uses long dormant animal cunning insatiable thirst vengeance avenge upon left one part count monte cristo one part golden age scifi stars destination introduces concept human augmentation teleportation unique wayhe bends around bidding whatever means available whatever method possible unlikable protagonist cuts swathe chaos destruction would oppose himthe characters particularly well developedmore traits foibles serve little purpose advance plot could well said gully foyles personality plot quest revenge central narrative plot female characters well developed serve little purpose dramatic tensionoverall good story bad person enjoyable read good storyjust thing summer book',
 'beautiful one favorite conroy novels language beautiful masterful storyteller',
 'abby participates dysfunctional family games abigail timberlake owner den antiquity store robbed contents timely offer tradd handsome son wealthy family includes request presence grandmothers house help advise antiques course family game becomes acquainted tradds siblings none appear approve appearance game long murder abby middle finding member avaricious family responsible book author myers continues develop characters cj abbys neighboring antique store owner famous long pointless stories abbys mother rebounds unsuccessful attempt become nun last book finding holy image abbys store myers creates humorous enjoyable book manages murderer one would least expect',
 'love series love series author combines humour intensity beautifully descriptions bear shifters cubs wonderful',
 'interesting pickle shop well thats different yes recipes like piper seems fairly sure loving aunt uncle new friends murder course almost made without guessing killer',
 'one star read one story',
 'solid airplane read good solid airplane read quite clancys style close',
 'four stars beautiful pictures',
 'one favorites one favorite books kids loved book growing insisted read least week bought friend start tradition',
 'amazing man engineer time men bravery without hesitation advancement technology incredibly fortunate country pride right nothing mind make reality great read',
 'buy shame kerby book terrible many drawings missing lovely details treated half wolf said last bookand good thing would never buy another drawings fact even many unpleasant drawings skulls worms crawling bats black crows totally unpleasant pictures none care even one sort early april fools joke second rate halloween drawings thrown together garner money vacation someone else draw evenings time perhaps beginning art student payment  spend saturday night date mind wander many directions trying explain worthless book inspired steampunkloving teenagers scary movie nightmares ugh wish could give stars',
 'taken deception  warning story resolved know played  end keeps interesting mystery several twists believable characters dialogue story line would given another star like cliffhanger type endingsi may reread book look forward author',
 'best bible best great kindle version wonderful study aid incorporates joyces insights amplified bible',
 'one complaint absolutely love series complaint book preordered months ago  short stories learned synopsis reviews already purchased   ended paying   short story galenjessamys story good informative dont feel worth price',
 'book souls terrfic read sure expect friend told book loved well written kept interested way',
 'love bj daniels love bj daniels drawls end right start might think know dont love style writing cowboys must read like books cant wait next one',
 'unusual excellent memoir reading memoir remind times book fiction many memoirs current times include growing castle author describes excellent detail would hard imagine growing castle full items past public coming toursfiennes shares story brother richards life epileptic brain damage caused fever details great length early research done patients brain damage way parents dealt richards illness caused behavioral problems certainly inspiration alli found story jumped around times left wondering missed something times hold attention glad however continued read book finished unique book',
 'kiss dead review always like read laurell hamilton novels short stories good characterization plenty action wellconstructed alternate earth supernatural beings',
 'try another edition want make sense beautiful story deserves better treatment got least kindle edition full garbled spellings broken sentence spacing many editing problems spell check gone mad either looks sometimes editor let cat type words page cannot find better kindle edition pay full price hard copy somewhere',
 'four stars times place',
 'great map paper thin great map printed poor stock mine ripped folds first time opened though bad well torn three places shame print large enough read good product except paper used could find laminated would excellent',
 'dont miss reading book edge seat totally captivated suspense amazing plot took different angles regret reading suggested several friends read series thank leslie giving us series',
 'great book loved story joe keri joes family hoot great developed characters root entire book look forward reading kowalskis',
 'definitely favorite maybe definitely favorite maybe close truth personally know many young adults whose lives destroyed student debt worthless degrees tier schools subject fiction',
 'hotshot terrific action packed story characters played perfection story close heart swimming truly wish playground young lovers love swim',
 'never disapointment always provocative always fantastic readcant buy stories fast enoughvery creative writingthe author found milieu',
 'great coloring book great coloring book like picture one side page incase markers bleed',
 'sweet little novella sweet novella two lonely people lost loved ones brought together nine year old son wants two favourite people get together short read enough character development care ending',
 'five stars great',
 'malibu paradise lost randall provides comprehensive history ringes malibu paradise attempts defend preserve also provides glimpse early history los angeles',
 'absolutely one favorite novels ever adore novel read many times plan reading many come thrilling ride exciting surprising first read hearttugging ultimately satisfying subsequent reads authors prose really strike chord marguerite blakeney woman womans fascinating foibles womans lovable sinsone alltime favorites highly recommended',
 'five stars good read',
 'five stars couldnt put',
 'need read kept reading nfl play lot going great mystery recommend',
 'smart fast paced book well written characters interesting situations part believable loved main character  years old still dealt insecurities thus relatable woman age like vulnerable strong also liked news room setting',
 'required reading zen novice thoroughly enjoyed reading book inspired spiritual journey kapleau truly helps westerners understand satori path enlightenment big questions whatis essence mu nothingnessoneness reach enlightenment faster others willing read ponder accounts expressed book many rewards',
 'five stars great book',
 'sweeping look may going bobbitts encyclopedic review interplay economics military might political structure rare authors pick single lens view interpret world bobbitt remarkable job bringing together key points many perspectives explain unlikely one explanation particular inflection point human history predictions may wind sobering worthy study contemplation consider prevent occurring',
 'quite good remember read entire series high school probably class loved  years later enamored wit probably need maintain memories books loved younger years rather rereading',
 'say yes read past page  get say like beginning understand purpose characters page  thebook starts make sense say middle book im really enjoying really liked end looking forward next two',
 'great book really enjoyed reading midwife venice loved historical aspects book definitely mustread like historical fiction',
 'long time since couldnt put book third book ive read loretta chase really fabulous read night simply know ending loved thus book sexy exciting full fun laughed loud enjoyed page',
 'absolutely beautiful illustrations hardly wait get started absolutely beautiful illustrations hardly wait get started angels one sided paper quality good',
 'j robb rocks jd robb rocks',
 'great book deep data great book really want understand science putting pelz scientist uses research data prove consistent way successfully putt ball hole deep reading worth really want understand things aa short game bible explains process lays data disputing evidence master reality like books ought try one short game schools live really improve book cerebral digested one sitting however read practice drills willing things differently improve',
 'colton fall love colton abbotts another great book incredible marie force',
 'youve loved maya banks books youve loved maya banks books one blow away characters powerful supporting characters add additional life complex story faint heart cant wait read remaining series books like men men women women book disappoint',
 'walk god goldplated shoes walking cane made ivory listen creflo really listen make like walk street suit made diamonds god rewards virtuous jewels jewelry fast cars apartments awesome parties women openminded men lots caviar champagnecreflo courage challenge old wives tale help less fortunate say many words help less fortunate help creflo dollar get another rolls royce god wants three rolls royces hes creflo dollar havent figured maybe rethink going lifecreflo inspiration us want take advantage skills god gave us folks dont think much giving money people obviously enriching backs',
 'five stars exactly wanted',
 'fast paced well written definitely page turner legal thriller witness murdered jake becomes prime suspect hes also suspected killing client friend blinky baroso nephew kip  years old becomes new future character series hes wisecracking delinquent travels jake abandoned silver mine beneath aspen ski slopes much humor along way generated kip along doc charlie riggs retired coroner granny raised jake great common sense novel moving powerful thriller could put drama overtakes reader pursue outcome wow great read author paul levine knows satisfy readers highly recommend',
 'good read  griffins books enjoyed however first one books disappointed ending checked book see pages missing realized really ending never read reviews books see one feel way would recommend new books start first one series already purchased books continue reading end could considered continued might move another author',
 'afghanistan blackpool parallel dramas murderous war afghanistan old peoples home scotland brought together fond love grandson senile grandmother wonderful writing',
 'excellent excellent like crimecourtroom dramas book tee works cant go wrong',
 'new genre literature id never read sethgraham smith familiar writing style found enjoying readat times didnt break hours gaining momentum loved ride ill definitely check books give tryits fun mix history fable think ill hit movie theater see hollywood book justice',
 'freee book free book could better',
 'five stars love cats love stress reliever',
 'wonderful really liked book engaging worried would another dysfunctional family book popular days wasnt read',
 'poetry strong suit like poetry good didnt care realize poetry strong suit like poetry good didnt care',
 'wonderful intervention routines book filled useful information preemergent readers th grade used struggling kindergarten students preemergent intervention routines described use traditional classroom work marvelously small intervention setting book format user friendly highly recommend jan richardsons ideas',
 'five stars great resource office',
 'well done historical fiction one played orphans like shirley temple shirleys life screen never like novel based true story longterm ramifications actions georgia tann woman respected work tennessee childrens home society s  subject told lives family five children taken parents  hard tries eldest rill unable keep siblings together alternating presentday story avery stafford discovers family somehow connected scandal orphan society sets unravel mystery presentation lends suspense author eventually reveals lisa wingate also masterful creating settings describing action sequences characters engaging unexpected surprises along way',
 'hard forget ever florida read book fictional story based historical facts make pop reading got reading many books florida history general',
 'randy wayne white dont miss love anything man writes',
 'food memoir book paean mostly lowland cooking pat conroys reallife experiences dont expect great santini wonderful recipes well written reminiscences',
 'five stars great gift student teachers',
 'good good never says kristi catherines friendship turns guess make woulda nice also keeps talking friend melissa never comes think chapters kristi catherine make melissa comes back tells everything happened overall good book',
 'two stars verbose dark difficult follow took long time wrap story',
 'great chemistry great story sexy get funny cute love kia matt',
 'nice read thriller different angle grabbed first line kept grabbing line lol nice read thriller different angle well done lisa next one',
 'five stars quick reading',
 'five stars great',
 'memorable summer worst bryson far loyal fan bryson international appeal designed local audience page page babe ruths singularly uninteresting private life even lindbergh sequence hardly riveting sounds like editorial assignment something author heart sells think years could',
 'never old live life great mcmillan shows characters expressing important enjoy matter oldkeep living',
 'five stars good reading',
 'riveting keeps riveted well researched interesting see law enforcement worked limited resources',
 'better recent scarpetta novels listened red mist audiobook reluctance early lover cornwells scarpetta series found drowning recent books im whole marino obsessed kay jealous benton storyline wish cornwell would find way help characters move forward emotionally said red mist actually enjoyable readthe plot tight storyline quick fast ending satisfying although little confused kay identified killer maybe missed something cds youre longtime fan cornwell wondering try red mist think youll enjoy lot old storytelling talent shines',
 'editing even finished first chapter many mistakes writers read books obviously dont editedso annoying many mistakes first chapter well first  chapter',
 'four stars along lines gone girl keeps interest',
 'clinical depression experienced author sophies choice lucid descriptive account torturous journey family member friend suffering depression youll want read book better understand theyre going  pages quick read worth time',
 'crime time book direct point law much thepoliticaly correct want twist law edition makes clear law created let lawyers fight courtthat courts yet book tells laws isit damn fault violate law time',
 'awesome series fell love story line much reading one came back amazon bought rest series group former navy seals alpha pack topsecret team wolf shifters psy powers fighting things get nasty one aric finds true love la cop cool works story women way really kick butt said story line sounds familiar like read somewhere give away book part mate die doesnt take away story love special op team overall seem work great together series cant jump pretty much need start beginning read orderreadblack moon alpha pack novelfirst really see story unfolds give really feel people story line',
 'good book gives basics consider retirement planningeasy readget yellow market highlight importantfacts companion book investing also recommended',
 'easy draw great book learning draw people  year old normally would draw stick figure draws like musch older child super easy follow illustrations',
 'poor funny bunch name dropping finish itdont waste time',
 'catching fire shocking way end book never saw ending coming cant wait watch movie compare book',
 'five stars good book cant wait next one',
 'psychic murder mystery unusual twist mystery jeff injured name result sees things happen rebuffed cops tries fill foresees injuries bring back touch rich halfbrother brothers significant',
 'five stars great book jim butcher im looking forward steampunk lovecraftian horrors vibe time',
 'prek students love book read prek class loved responsive enjoyed illustrations words dinosaurs say good night slightly popular however theyve asked read one definitely hit age group love rhyming',
 'great condition cute story great condition cute story one favorites growing',
 'best certainly best',
 'excellent excellent read information ability applied correctly change world must read ages',
 'great book love series gives tru feel venice characters interesting love human side detective family',
 'five stars good book good instructions well illustrated',
 'excellent book first book read excellent book first book read karen white plan read many excellent story well written recommended sister',
 'thought book going nice little light book thought book going nice little light book however turned book dealt cycle abuse moriarty mixes lighthearted serious events beautifully',
 'great step step guide useful book specific steps examples would recommend purchasing book msw students new clinicians',
 'excellent kids kid loves book easy read lots fun visuals',
 'highly recommended biblical treatise serving needy keller pastor manhattans redeemer presbyterian church writes brilliantly biblical call social action love neighbor avoiding extremes social gospel crowd evangelize crowd keller shows scriptures call care poor integral part lives disciples christ keller especially adept making gospel connections strength book keller powerfully explains need salvation met christ compel us care others physical spiritual need apart excellent material herein worth book second last chapter justice public square  pages long read due compact size hardcover edition  x  highly recommended anyone seeking biblicallyinformed view serving need',
 'sent son iccc per selection thinking pocket size able put pocket bit bigger son thrilled four books got work book although get book go workbook silly mothers need read fabulous deliver time thank much dawn norris',
 'another amazing karen white novel karen white master story teller every book writes excellent long time gone wonderful mix past present mystery romance family',
 'ok hoping better ideas regarding best graphics depict information conveyed felt like book different apps available use creating graphics less best types visuals',
 'master good stuff books written  master good stuff books written  years ago retain alluring mystery interesting characters',
 'five stars awesome set great price',
 'reeks agenda havent enough politics personal destruction author thinks news flash everyone faults everyone ideals falls short perfection elect politicians vote according preferences arent models virtue shocking revelation know right loves make michael moore punching bag maybe thats hes fat since engages hyperbole hes easy target politicians parse words little carefully decide going vote issues book like lambasts people one end political spectrum human failings gives pass conservatives help thank schweizer dividing beloved land red blue camps slide civil war beacon freedom snuffed due part hatefilled rhetoric hope true patriots hold accountable end day peter schweizer attacking americans kim jong il kaddafi bin laden sworn enemies beloved homeland attacking stretches limits taste abuses right free speech brothers fought died attacking fellow americans shameful extreme',
 'alvin ho rated book  star short book lot pictures overall  star',
 'panther another great demille book starring inimitable john corey witty irreverant character brilliantly written novels always page turner cant wait next',
 'j robb always hit far one many hits read two days loved story death books great always recommended books dont read nora roberts love done booksshe perfected dallas roark couple always look forward next book',
 'happy find digital version read long time ago friend lent happy find digital version enjoyed',
 'bought gift recipient gave bought gift recipient gave thumbs thought wrong book first comes different covers',
 'first book heard editor told could publish book blank pages would bestseller girl train id say fans would appreciated book unorganized hot mess must seriously distracted wrote',
 'read devoured day devoured daywow say without spoilersif like seriesyou read thisso much happens end great parts good yet others make really wonder lead us magic shiftsexcited release new book',
 'another thriller pitting duke fatherinlaw st cyrus books well written full good balance action dialogue bi romance exception starts couple finding dead man head sitting top bridge shades ichabod crane another victim separated head next mysterious man starts following st cyrs wife baby uh ohthats sure bring pappa run addition bad guyswhoever st cyrs fatherinlaw threatening death anything happens daughter grandson suspects convincing enough one dies start deciding think guilty one keep guessing enjoyable always',
 'survivors club lot info usable interesting read cold winter night didnt find much could use',
 'five stars loved book john corey one favorite characters',
 'must read one books everyone read graphic written make impact sure important see cost total war',
 'simple story catchy think kind simples detective story easy read makes want see disclosure recommend especially stay buses day',
 'another bright star world pnr jeaniene frost ability frost create story paper immediately immersed new world amazing moment met ivy plight emotions wellbeing encompassed good versus evil fight ivy dark secrets past future slowly revealed creating action packed adventure unknownadrian creature sent either rescue kill ivy proves worth saving instant physical connection confusing otherworldly angel demons good bad light dark ivy save sister save betrayal along way sayi love new seriesanother bright star world pnr jeaniene frost starst',
 'five stars great book',
 'enjoyed thoroughly interesting way tell tale twisted enjoyed thoroughly',
 'somewhat return form even though doesnt quite get level classic era clancy quite good recovers elements made clancy great prime especially current trends world politics computer science hacking one best since nadir red rabbit represented imho clancys best classic characters back ie jack ryan sr john clark domingo chavez well younger ranks jack jr dom caruso sam driscoll altogether nice entertaining read even rock world way without remorse debt honor executive orders back day',
 'story encouraging especially orphans people story encouraging especially orphans people grew foster homes brings hope well done teressa',
 'five stars loved',
 'rags riches formula immigrantsis formulaic rags riches novel takes place period late s s go san francisco earthquake shipping opportunities world war beginning airline industry stock market crash dan lavette goes teenager whose immigrant parents killed earthquake multimillionaire homeless stock market crash along way loveless marriage nob hill socialite children doesnt know love affair love child chinese girl spite unique storyline enjoyed characters interesting information life california period time books take place timeframe centered around eastern locations like new york city author appears done lots research bring location life even get glimpse movie industry getting ready talkies emergence california wine industry like book enough consider reading rest books series',
 'digging old older memories book like well told story coffee maybe relate type family always secrets grew near sacandaga reservoir s pretty creeped town water much cant bring research true imagination topics covered story powerful anna quindlen writes moving words touch everyone reach another winner also personally flooded irene',
 'almost read like numa excursion always juan austin zavala think would always make great team good treat briefly appear plot',
 'great read easy read interesting read really like women prison recommend finished  days',
 'shes good generally read hardly ever impressed genre fiction ilk ive heard much ruth rendell thought id give try low behold good top drawer entertainment credible creepy characterization absolutely buying reading prolific author great crime writing',
 'nice book rhyming little bought book take beach vacation  year old wanted get excited beach fun book son really interested ebach exactly youd expect nice way teach beach issue book rhymes kind times read  plus times senteces rhymes dont work consistently kind weird however love book would recommend',
 'five stars great',
 'hit grandson loves able draw pets great idea would recommend ages ',
 'good overall id dying read book excuse pun awhile thoroughly enjoyed true blood tv series basedi wasnt exactly disappointed hardly blown away story started strong introduced sookie bill sookies grandmother sookie saves bill bill saves sookie sookie bill lots sex lots pretty erotic nothing havent read nearly graphic television show thinking shaping really good readthen around  things start take turn worse typos point barely noticeable get numerous sookie hears things either takes completely stride story written firstperson way sookies perspective acts though already knows obviously doesnt explanation given would almost author rushing get story finishedtowards end  onwards things start get better ending quite satisying although fight scene end little unrealistici dont think worth hype definitely picking next series nonetheless',
 'empty creepy story isolating painful like people easy ignore misery mentally ill absolutely turn recognise strange behaviors dont remember incident time happened eerily similar recent las vegas shooting fact never know happens someone decides deliberately plot execute elaborate plan hurt people',
 'waited disappointed read books series seemed like forever till came great series please make last',
 'bed roses bed roses nice continuation series nora roberts series full romance possibilities need',
 'great book easy read brad taylors books fast paced easy reads one developed history times',
 'change mind change world dont review often book made positive impact life throw five stars well reading book never questioned insecure thoughts brain produced thought stuck learning challenge thoughts relieved much anxiety selfdoubt approach dr luciano uses help rewire thought patterns isnt complicated believe everyone reads book open mind willing give honest try able strengthen hisher mental muscles greatly buy bookfor price couple drinks starbucks could well way feeling much better',
 'silly profound rabbi david aaron writes popular books kabbalah endless light deviate path reader looking book academic rigor wrong book rabbi aaron also lapses obvious simple stories illustrate points habit become annoying make reader feel dumbing things far beyond necessary pointrabbi aarons text good points well takes difficult times abstract concepts like tzimtzum contraction created universe kabbalah isaac luria makes apt lovely comparisons romantic love love toward ones children takes lofty ideas concept brings earth ways understandhe also makes strides divorcing concepts entrenched definitions instance god rendered hashem thus able move idea god away old man sky image make divine monist pantheistic idea although kept reasonable check also firmly ties ideas halakah ie jewish ritual practice meaningful kabbalah without torah',
 'well written novel tubby good start tubby series little wacky side hey new orleans',
 'omg awesome ride excellent read ages alert spoilers aheadfor read first two books found brooke bad side turns back end shes sorry team kinda forgave way gives clues felt sorry happy reading',
 'five stars great description machine politics pitfalls',
 'another fantastic jason bourne novel eric van lustbader done marvelous job robert ludlums jason bourne novels almost actually reading robert ludlum',
 'excellent summary science movie interstellar book kip thorne excellent summary physics used write direct film interstellar',
 'dont think would brave enough tell dont think would brave enough tell story goof ups life penny',
 'house hermit crab book great condition arrived quickly mention amazons prices great little girl bought couldnt wait parents read',
 'great read find baldaccis books real page turners king maxwell series exception cant wait read next one',
 'stevensons tushery triumphant robert louis stevensons classic first written serial installments  takes one back th century gallantry embodied hero sir richard shelton stevenson loved characterization richard crookback richard iii whose skeleton recently found england stevenson encapsulates wars roses delightful novel simple enough young understand enchanting enough mature enjoy heartily recommend',
 'three stars really rather boring sorry',
 'good read enjoyed love walter realistic things mentioned work',
 'jackie bites dust pityso sorry author died whole family loves jackie great seriesperhaps tv mini series possible',
 'gift father whos big demille fan said wasnt great havent read yet mother hijacked suitcase leaving',
 'getting better time jack jr good better father keep going protecting home front',
 'five stars got friend',
 'new cast tom clancy story fine missed regular cast characters made difficult really get story',
 'good read good gentle read',
 'good book expected muties redebouts technical side story little id like know howler muties sec droid aircraft carrier maybe author book expainse droid work  yrs also books set  yrs nuclear war arcraft carriers reactor cold ie cappable nuclear reaction half life centuries  century passed skydark think editors need proof read thes books better love series things like make mad come expect level technical know writer book sadly lacking areas end good read like plot twist end great look forward many adventures ryan crew',
 'three stars loved audio dont think would enjoyed reading',
 'five stars ballsy crazy dude',
 'give try great adventure story read norton boy stories hold well son reading',
 'ending book great example howards amazing talent ok would buy expectation enjoyment normally experience books im lh fan saw another book paranormal elements coming excitedhowever actual book falls short pun intended problem book short character plot development severely lacking actively disliked hero  booki could overlooked plotcharacter problems werent ending isnt one book ending well ending major cliffhanger completely unsatifying really hate kind endingsso youre like hate books ending wait whole series comes sit read order',
 'backstage audrey meadows relates coming goings honeymooners production also paints picture new york s almost one seeing sites new york spending saturday evenings backstage honeymooners set great read',
 'silver angel title silver angelauthor johanna lindseygenre historical romancerating  starsthe thing didnt like ending felt rushed returned england one thing would liked known dereks grandfather felt everything happened',
 'great read kept guessing well written great read kept guessing',
 'great conclusion ending great reading continued story northern waste baby sister tatiana grown meets man known tristan searching scientist name tolliver may help find destroy gavin ward man kept imprisoned experimented created deadly virus blood tissues man tristian says knows tolliver follows camp underground unbeknownst people camp infected virus begins story much action stories really like way author uses language forms series bad guys hard find good ones frozen wasteland much enjoyed reading three books must start book one understand characters reappear stories good seeing family get together end one wizard yuriko raina tatiana',
 'bit looong stilla good book get know fabled hari seldon youthyou get see trantor golden albeit jaded yearsyou get know became spacers came trantorwe neveralasknowbutreallythere sections made think asimov paid word absolutely pointless adventure citys roofthe repeated bickering psychohistory feasible good griefas didnt know alsothere moments hari behaving really rude manner reasons lamely justified mycogen chapter one asimovs masterpiecesin criticizes ethnic separatism selfrighteousness wise depict jadedness decaying empire would needed jack vance',
 'dear luke could funnier john moes book dear luke type book humor subjective often reliant reader least base familiarity material spoofed whole badly constructed set correspondences times seems go shock value utter ridiculous rather subtle types collections often find hilarious might reverse really probably going funny people enjoy satire bit geek streak best bits geeky however youre easily offended book cup tealaugh loud funny times humorous definitely youre open want something interconnected wont find unless count super bowl halftime show proposals run throughout book want sometimes irreverent absurd commentary though good place start solid   good never really feeling like type book ever amusing coffee tablebrowsing volume',
 'attilas leadership curriculum coach spend lot time reading leadership material opinion one best little books leadership printed dr roberts condensed pillars good leadership easy read book anyone would like become solid leader industry many books go mind numbing infinite detail every aspect leadership end overwhelm youre looking something either enlighten characteristics found real leaders need quick reminder book',
 'loved read rachel hauck books think like one best good clean fun',
 'five stars scott pratt good author',
 'snoozefest youre interested stereotypical navelgazing psuedoartist protaganist books spoiler alert nothing happensskip read something else',
 'nice well made book though wife likes book prefer angie grace books animals people pictures requires using lot colors nice well made book though',
 'loved hardly wait next book series feel like hanging hope hurries writes',
 'plethora information matt great job giving advice regions globe indepth full websites check things dont want run plane ticket youre done reading would surprised',
 'nd book winner enjoyed one much first issues grand scale luck reread series makes good book great thanks',
 'great book love story characters ilona daniels hooked story hardly wait read next installment',
 'much profanity enjoyed every hugh howey book gave   stars read  pages one many cuss words immediately turned know ability write amazing stories without using much profanity wish could get money back',
 'good suspense well written good suspense',
 'geopolitical primer book big surprise presented interesting format provides insight stirs geopolitical pressures ten books must read list one',
 'great read love series cant wait next oneif enjoy military scifi great read',
 'propagates pernicious misconception henig admittedly takes creative license fill historical gaps goes far propagating misconception mendel sent copy paper charles darwin darwin never read urban legend also brought authors philip kitcher made way newspaper articles even textbooks catalogs darwins library early s later made mention mendels paper instead secondary source focke mentioned mendel darwins library relevant pages uncut see andrew sclaters  article georgia journal science',
 'typical demille thriller person old enough remember evening television broadcasts debacle southeast asia embrace demilles detailed portrait moving gripping frustrating thrilling nelson demille novel thats extremely close reality history',
 'utterly delightful coloring book much fun enjoyed brief descriptions various sea monsters makes nice change flowers mandalas',
 'dky wonderful example ms miller talent great talent story telling makes want part world characters come life books reread many enjoy',
 'another brilliant read nora roberts never fails another brilliant read nora roberts never fails keep reading plus quite new line writing latest books love',
 'five stars product shipped received promised',
 'horrible read better balogh ending drags hannahs sister fairly menial fortunately novel long beginning boom far better ending melodramatic overwrought',
 'great book bought book  month old son loves read countless times loves trains book become favorite cute catchy story enjoy reading',
 'four stars fantastic perspective palliative medicine found little bit short',
 'never endangered pickets nate lives cat fast paced twist unexpected turns would recommend looking forward next book',
 'four stars good book',
 'well done engaging usual ggk good job fictionalizing historical period adding light coat fantasy fantasy becoming center story characters thoroughly developed settings well rendered author good job managing plot lines leaving loose ends usual ggk doesnt go easy happy endings provides satisfying conclusion entry par sarantium series good book well written engaging',
 'tour de force north korea read one book north korea north korea news thought would pick book im glad remarkable book giving insiders view country reads like thriller well researched well written',
 'highly recommended quite enjoyed pacific crucible look forward release conquering tide overall id recommend book interest understanding beginning war pacific also wanting read book moves fast pace minibiographies critical personal sides especially enjoyable',
 'history philosophy rhetoric form autobiography philosophical work native american disguised autobiography ohiyesa begins life story fifteenth year father arrives village take mission school ohiyesa chronicles journey world white man along way learns language submits educational system culminating attainment md boston college agency doctor one first people view scene wounded knee massacre subsequently served missionary ymca throughout history life late th century america offers thoughts disconnect proclaimed christian beliefs whites actual behavior contrasted traditional morality native people indictment devastating fills vital place historical philosophical native rhetorical studies',
 'creepsters truth shocking fiction heres ultimate political thriller true powerful memoir details woodward bernstein young reporters covering bland local issues got opportunity lifetime unhatch story vast political corruption unethical dirty tricks known loosely watergate directly started domino effect eventually toppled president richard nixon got many cronies thrown slammer addition crucial information scandal book also invaluable look process investigative journalism reporters cultivate sources follow obscure leads break huge story note book hard follow places especially regards names many washington conspirators involved issue watergate rather woodward bernsteins writing scandal huge anyones comprehension find continually looking back cast characters authors helpfully placed front book finally book crucial vindication true necessity vigorous free press america kudos woodward bernstein courageous colleagues exposing ultimate arrogance power damage american democratic process doomsdayer',
 'disappointing great fan elizabeth george enjoy inspector lynley series concur marte one book seemed pretentious wonder eg actually wrote persevered book enjoyed books date',
 'remarkably poignant wonderful story sister boy autism conveys struggles joys love honesty simple truth author gift read share enjoy',
 'enjoyable read much ive read enjoyable read much ive read againi read deployment first came still great story remember',
 'great story different levels great story different levels bullying physical violence gay discrimination etc good read',
 'good could great realize pre  book mr childs th reacher book plot subplots story line great would rate star give mr child  star character accuracy understand mr child british chosen american military man main character great however mr childs lack understanding american military system detracts books example book indicated ft dix marine base army speaks housing uses terms like bungalows military men never use would referred family housing officer quarters officers stay barracks therefore mr childs lack proper terminology detracts books probably many military personnel read needs military advisor help along',
 'engrossing tale folly fortitude lost city z engrossing tale folly fortitude great read without deficiencies material rgs seems like padding impedes story also thought subtext book deforestation amazonia emphasized try search places mentioned book google earth see towns cities jungle coincidentally pbs airing ken burns doc theodore roosevelt reading book suffered river doubt much colonel',
 'five stars loved kids school library',
 'five stars fascinating',
 'love book great story book great love story loved thrill definitely one best books along many others',
 'anne rule i killer love anne rule books shes good writer ebook real good i killer killed coast california washington state read real quick good',
 'sandford best golden prey john sandfords twentyseventh lucas davenport novel amazing first novel lucas davenport trying get feel recent appointment us marshalls service spending career member minnesota bureau criminal apprehension lucass appointment met skepticism people office lucas answers supervisor washington lucas searching first case colleague tells robbery involves drug cartel murder small child knows must find culprits backwoods tennessee high desert west texas aid special operations group officers bob rae track suspects nothing lose trying outwit contract killers cartel sandford takes reader heartstopping fastpaced ride best novel date given advanced copy book opinions expressed review',
 'superb literate almost  review book dont know must add comments add must reader nonfiction biography cannot say led kite runner great honor read book read debated digested years great literature outstanding comtemporary fiction read many years  times story kicks reader gut movingi recommend kite runner squeamish sensitive tough violent book read care profanity graphic violence current fiction often much worseamir hussan many characters book stick consciousness long time',
 'thing missing list pantry items love cook work way thru im one thing missing list pantry items love cook work way thru recipes great item new cook wonderful gift give full encouragement',
 'mercy saves everybody unputdownable worthy continuation mercy thompson series cant wait next onethis expensive worth every cent',
 'loved another amazing story felt end nearing slowed reading desperate end story happened yelled nooooo read last line wow loved read',
 'sookie liked installment sookie series love felt little bored spots know pops back things heat certain vamp interest increased general love series sookie four five take look forward next one though',
 'five stars made well content',
 'afterthefact wisdom much like book matter book provides insight much insight begin withelectrification important net would known pointjust matter seriously flawed ignored much important question business matter rewiring book asking questions havent asked alreadyfinally kind wisdom usually received experts financial markets factzero stars give one star zero stars wasnt allowed',
 'four stars really enjoy reading devotional',
 'five stars good narrative journalistic story telling interesting critique consequences onedimensional conservation methods',
 'five stars gift',
 'love book book cape light series thomas kinkade katherine spencer wonderful rest would highly recommend anyone looking feel good story',
 'word usage beautiful yet reader cant wait find middle write review anyway always pd james books pleasure read word usage beautiful yet reader cant wait find happens next',
 'masterful blending history fiction reality rating what beautiful blending history fiction pressfield master makes us remember much owe greek culture special interest elevated role women ancient greek society compared recent european societies machismo rampart jas',
 'book  continues good reading cleverness series book  continues good reading cleverness series really enjoyed  quite different mysteries solve plus inclusion veterans storyline tied one mysteries well bonus adding  nonfiction books book wish list wish town bookstore cafe book clubs terrific setting',
 'nice lord talk daily ordered several given friends terrific way start day',
 'love series enjoying whole series love two main characters crack hope enjoy much',
 'another hit master teacher writer writer one howto scene quite like william zinsser teaching writing books would reading list highly recommend even though mostly words writers zinssers vision made book possible us interested learning write memoir genre',
 'fun book read prince rafe bored spoiled spending time frivolous pursuits changed father needs take vacation throne rafe prove takes king rafe meets daniela boredom seems disappear daniela hero people country robin hood thing brave loyal perfect liked characters book times found shaking head ridiculous things wound rafe behaved like jerk holdingfondling girlfriend lap front council advisors wonder people wouldnt take authority seriously really daniela instantly becoming bonded queen mother five minutes wakes coma understand author going much said enjoy book liked fact rafe daniela united front love fought almost obstacles together definitely read gaelen foleys books',
 'five stars fun reading',
 'great story humor right alley story interesting fast read thoroughly enjoyed every minute didnt want put',
 'motion express boredom sloppyimprecise prose  painful pages get trial thenits hard carewithout depth pacinga chore complete',
 'boys favorite book boys   obsessed things star wars havent seen movies yet full questions characters book perfect introduce characters strengths weaknesses also helpful since memory star wars movies bit hazy ask specific questions rates characters strengths various categories guess think win battle look back find answer shows ratings graph form translates perfectly kindergarteners math lessons read several matchups night never get tired',
 'like large scale used helps considering vision roadmap collector recently bought randmcnally maps  states plus us interstate map like large scale used helps considering vision less ideal also use maps supplement official maps also entire country',
 'five stars good story',
 'sorry couldnt get anything normally read nonficition books many good nonficitoin books learning years ago decided read fiction book recommendation good friend celestine prophecy thoroughly enjoyed heard tenth insight felt good time read second fiction book however truly disillusioned tried twice read book eventually got half way book could understand book tried twice got nothing read average  minutes every morning  am read  nonfiction books one book half fiction',
 'amazing fun silly potty book boys book published time princess potty book amazing amazon bestseller authorillustrator duo pirate potty works much way princess potty book boys book making potty time fun excellent kind weird think need book make potty positive thing book really helps think coolreading book see book real hero pirate means mind reader hero really young boy whos reading book reminds bit captain underpants totally appropriate young kids make comparison protagonist easy identify withanyway great book wont let comes making potty training happen effectively positive vibes',
 'lovely warm exciting time loved book find sitting night reading cant put',
 'doesnt quite add im mediumsized fan oatesamerican appetites bitter heart favorites saga misses boat bit never seems logical even probable father would exile daughter raped though book set s feels like s morality tale takes things fall apart center cannot hold thing far making seemingly ideal family go leave beaver perfection mean wacked folks become due one incidentterrible may kept reading reading left dissapointed end like romance novelist oates sews everything neatly mulvaneys got messed earlier rape probably wont keep stuff together either',
 'great book second reading wonderful book read   read  jios books excellent author recommend highly like lost loves secrets new loves',
 'great book best one three great book best one three',
 'interesting reading sweet story little dog connections cannot see know great read informative',
 'love book yo got book library fell love came find happy came full alphabet lots signs words easy see sweet pictures would recommend even child wasnt interested asl yet',
 'five stars good story reads well light read many real questions inside enjoying series',
 'five stars loved movie bought book different movie loved',
 'five stars good read',
 'epic one favourite books time im huge fan nalini singh series opinion best series date',
 'great great book love reading books old montana old west tell friends read book',
 'wickedly funny razor sharp could love sentences like vicky nothing much write home dim honeyblond creature spectacles goodegg type often found bennington sitting apple tree group similarly undistinguished girls pile knitting need know bennington understand book bennington could fictitious wed get gist story youth friendship riding reputation expectations college cynicismbitchinesswhich delightfulis perfectly balanced naivet tender honesty feel better world knowing charlotte silver best book ive read long time',
 'favorite probably favorite story st patricks day read aloud village crookhaven cursed local witchs horse stolen king kate osullivan father brothers try steal horse back captured kate weave series tales get hook describing true stories family worse spot one king amused enthralled kates tales last one work undone astonishing secret revealedhudson talbotts illustrations riot color action expressions characters evocative laugh loudgrab irish music play background share story everyone story compell read irish brogue even never beforehudson talbott books like storyteller sitting elbow pacing story interplays illustrations perfect',
 'good book good read keeps moving good story line liked alot',
 'love mckenzies jennifer wrong withher mckenzies watching daniel grow books always likable character jennifer continues make strong man uncles moral fibre look forward next family saga',
 'difficult confusing book read confusing book difficult read enjoyable hope improve forced finish lead character confused psycho alcohol problem',
 'four stars good',
 'details details details many thing many descriptive details book far exceeded book took work get dont need entire chapter devoted describing relationship chicken paragraph would sufficedthe protagonists unpredictable outrageous emotiondriven actions story seemed inconsistent taskoriented sophisticated perfect paula impression given book advertisement book expecting based description wordsi didnt unique lasting impressions reading book enjoyable read feel though wasted time read book withstands test time said finish book mostly end curious see plot got wrapped suppose something said',
 'cant wait next book really didnt want book end tried read slowly wait  years find happens love series love characters love author hope changes mind writes  books',
 'nice book warsaw ghetto atrocities nice book warsaw ghetto atrocities hitler read hype jewishness expecting deliver front',
 'part good series book continuation good series good plot excellent descriptive writing part recommended',
 'must read every buddhist confessions buddhist atheist wellknown buddhistauthor stephen batchelor eyeopening mustread book every buddhist contemplating becoming buddhist holiness dalai lama batchelor captures true essence buddhism much writings beliefs already practice india buddha incorporated ideas placate indian religious authorities much anything example batchelor points cycle birth rebirth first proposed buddha already wellestablished indian religions always trouble believing birthrebirth concept nice read highlyesteemed educated buddhist part real buddhism batchelor goes back pali canon oldest record buddhas beliefs teachings get fundamental truths buddhism open new ideas accept buddhist teachings ultimate word even must read book',
 'one dovers best complex patterns smaller color blocks lend deep jewel tones really like perfect prismacolor pencils paper layers blends nicely relaxing',
 'jack reacher wife love reacher books one sent sonas birthday presentcant really comment',
 'wont believe itat first book scientific strategies existence struck book stuff new really new many books make claim end though usually techniques years past used different examplesi purchased many kevin hogans products years andi never let tell buy book moment see something release new book hoganget',
 'missing pages chapter  three half pages blank suppose three questions answers disappointing chance could get copy missing pages',
 'four stars well written interesting format',
 'three stars favorite',
 'young love found interesting book easy read didnt want put started bwould compare twilight series minus vampires',
 'enjoyed novel enjoyed novel discursive lots asides lots lists jonathan safran foer seems love enumerating things ruminations lists contain lot meat novel author illuminates many aspects life era read book many moments thinking yes identified exactly humor personal found lots quite funny wry sort way',
 'joe pickett rides usual c j box come another good joe pickett book read every book series continue read long series continues',
 'always good always great cant go wrong things even husband likes listen road trips',
 'would recommend book highly fan anna quindlen books book resonated setting went thru something similar teens dam built area growing years fact cause able review book fairly think would recommend book highly',
 'want great novel really enjoyed novel easy read would like seen ending little drawn otherwise delightful',
 'five stars totally captivated',
 'gentleman bested three young nieces school teacher book everything love romance great love story lot humor hero trying get  nieces control funny heart soft get done thinks hiring school teacher answer problems soon finds wrong three girls wonderful love',
 'love love stories one best authors',
 'die hard prey fan waited seemed like forever book become available disappointed action crisp plot twists keep reader engaged tense gripe read fast like thanksgiving meal takes far longer preparation partaking sandfords books finished way soon bad dilemma means',
 'pioneer travel writing great fun filled fascinating insights ms birkett walked walk pitcairn emerged notebooks intact enjoy high adventure sense humor treasure book',
 'even better dummies guide brainer guide s color pictures real clear text pithy step clearly explained dummies guides wordy often difficult find want text pictures make easy avoid frustrations actually want bravo simplicity',
 'another reacher classic another jack reacher story intrigue deception jack inimitable way solves problem beating bad guys saving lady highly recommend lee childs books anyone familiar one jack reacher fans',
 'dont turn nose read order want gimmick completely brilliant ordinarily wouldnt go read whatever order want kind set ill damned chapter sequence cortazar recommends doesnt create one wonderful richly delirious novels ive come across moving europe latin america delightfully vaporous intellectual ruminations made footnotes excerpts fake utopian polemics sections modernist experimentation brief impressions parisian street life etc found completely tangled thoughts dramas dreams occasional moments pathos make horacio oliveiras life bouncing back forth chapters beautifully reinforces wonderful streamofconsciousness style cant remember many times much fun slink satirically overintellectualized mind',
 'professor full circle enjoyed favorite got see gabriels julianne story goes get see loose ends tied upstill much love professor must read read first  trilogy',
 'record weather today extremely satisfied bits sheer surprise wow okay mr palahniuk done book charmed way presenteda diary thats done executed well fell love lead character way narrator spoke mistys constant struggle several oh god moments reading bookthats never happened even reading chucks previous work far favorite upchuck book diehard palahniuk fan must read anyone wanting give chuckie whirl take book spin someone looking darkfunhumoroustwistedexciting quick readgo ahead upchuck delivered goods book always book filled classic palahniuk twists turnswhich grown love another thing wonderful book fabulous oh chuck oneliners thank mr p fantastic read salute',
 'epic conclusion trilogy mocking jay epic conclusion hunger games trilogy book katniss everdeen rescued electrifying quarter quell learns district  really exist revolution coming evil capitol panem katniss must become s mocking jay symbol hope rebellion personally really enjoyed book ignoring bad reviews think second best trilogy books book written suzanne collins third last series book gripping epic conclusion highly recommend book hardcore readers fans first two books',
 'interesting slowmoving piper kerman devoted boyfriend promising career arrested  yearold drug trafficking offense product brief postcollege fling world drug smuggling intriguing memoir illustrates prison life minimumsecurity facility course  month sentence interesting often insightful orange covers daily interactions prisoners guards prison food surprising sense community maintained fellow inmates downside however books episodic nature sets slow pace often making chore slog',
 'melody know wrote book could jd robb style path rest books going take wont reading anymore makes sad considering awesome characters books murder mystery one entirely dark like reading killers point view often',
 'entertaining book entertaining best series continue read books janet evanovich',
 'devotional always loved using devotional discovered pagan roots kind lost without devotional one covers hearts desire little book blessing restored belief well universe gives soul needs daily basis thank',
 'five stars love karen kingsbury books',
 'sadly unfinished premise daughter centaurs intriguing execution lacking really really wanted like book especially rarely see centaurs focus novel couldnt convince read fifty pages said cannot guarantee novel didnt improve wouldnt grown eventually end decided set aside main character held little appeal pace novel abysmally slow may try reread point wont soon',
 'took awhile difficult time getting one hooked sure read til end',
 'dont judge book cover book exists make zen accessible general public people practice zen essence zen felt understood words pictures one teaching zen would would appreciate beautiful moon rather finger points moon admit book manages clarify certain teachings also mentioned zen books im saying comic book provides absolute truth certain zen sayings provide insight everyone ways getting original thinking book cuts cake think merely comic book delusion remember dont judge book cover book profoundly serious highly recommended lighthearted enjoyable read',
 'disappointed read authors books usually enjoyed much dont know day problem reason really like book thats say wont read books future would recommend one especially someone never read books susan wittig albert maybe problem normal cast characters present missed cant really say looking forward next one though',
 'wife says pass one ok wife says look copycat restaurant recipes copycat restaurant cook books amazon find better selection',
 'florida courtroom thriller avid mystery historical reader youmr irvingare easy number nine thank youthats allnow rest stories',
 'cute many pictures story cute pictures well done bunch pictures many pages words hard time reading story lacking words feel silly pointing things guess son picked also prefers hear story also wants move past pages',
 'five stars always count jan karon',
 'attention keeper decent read took longer read authors books didnt keep attention',
 'hard time book enjoyed readers like two heroes book easy reading',
 'love gideon cross love romance eva gideon book good sylvia please dont take long pt  need gonna happen next',
 'dark dark story descriptive elements good premise book used lot since first publication hold attention',
 'another rollicking adventure another great oregon adventure courtesy master adventure clive cussler couldnt put guys go next',
 'murder mystery private club thirtyone men meet year celebrate joy surviving one yearthis fictional club sounds patterned real life club s university virginia organization whose members good deeds one knows members die death announced tolling bell rotundain book club started century members dont good bad deeds group one man remains alive select thirty continue legacy mysteriously men begin dying unusually fast pace private investigator matthew scudder hired catch killer onei wont reveal surprise ending thirteenth book blocks mystery series featuring matthew scudder',
 'others may like fine really wasnt sort novel others may like fine',
 'another rich overeducated writer slumming book deal author grew maid service cant seem understand former maid didnt want talk shop suspect  percent maids personal ethics woman fact anyone loved book appalling wont go repetitive details book bad peoples comments favor save cash would spent',
 'solid nongross horror mystery wish wasnt firstperson though character development others quite good mystery wasnt difficult bit slow developing totally engaging makes sense writing excellent niggles pulled story hard say without spoilers find heroines previous casual sexual encounters unsettling tune times understand author didnt want heroine virgin fine casual sex without type birth control couldnt possibly common heroine doesnt come least bit slutty call people practice pullout method parents valid birth control couldnt quite get things bothered throughout book spite excellent writing ending tied everything nicely wholly convincingly romance romantic element fairly strong though slowbuilding contain sex horror element excellently executed',
 'best laugh ive reading long time isnt often finish book say funny truly hope stephanie mcafee able publish books future home justice perfecti acknowledge grammartypo issues may frustrate anything like got far story notice andor carebottom line great first book look forward',
 'pure magic another fantastic trilogy nora roberts enchanted first page cant wait magical stories hopefully cousins odwyer',
 'great book book great escape thought bringing another character since trainees sure worked',
 'really really good enjoyed every minute reading book really like author first book ive read last funny romantic without going far didnt get bored middle ive found happens lot books sat back enjoyed journey highly recommend',
 'ah nice try ah nice try end trying hard couldnt bridge gaps second half better book sad full potential ultimately incomprehensible',
 'redundantly king love read stephen king novels home library right away found one better efforts long pedantic first  pages slow reading basis rest book end book right really gruesome ideas could brought life kings genius idea ufos always captivated humankind book uses fear unknown psychological scare maybe king right happens get big flying saucer backyard lets hope never find',
 'four stars kevin kearney books many',
 'starts slowly accelerates full twists turns read one novel alex berenson enjoyed one equally much feels current overly contrived would make great movie really enjoy main character john wells solid smart complicated',
 'honest accounting beautifully told story growing parents hoard authors expressions childhood thoughts endearing sad held honest trust would recommend book anyone faced childhood challenges',
 'time capsal sinking published months titanic disaster collection accounts tragedy although book contains much hearsay inaccuracies sensationalization reading stories lips survivors witnesses moving meaningful accounts infused feeling times melodramatic sentimentality thoroughly offensive racism sexism classism despite book intriguing perspective titanic fully capturing psychology sociology times reaction disaster also reflecting caused grade b',
 'five stars cool definitely read pieces book good true',
 'five stars classic favorite house',
 'big little lies well written unrealistic story heroine behaves unbelievably stupid thatthe whole premise story follows one idiotic act disappointing',
 'myron win back spades myron bolitar back coben brings whole gang harlans titles great bolitar series pure reading entertainment please give us',
 'excellent book  paratroopers easy company excellent book  paratroopers easy company  year old dad wwll enjoyed',
 'five stars promised',
 'ok read book ok good books series main character enough going pull entire book even though given lot good characteristics still seemed slow',
 'still writing suspense great twists turns love style attention detail baldacci extensive research books one exception',
 'pithy title great story hilarious story one womans success perks enjoys economy fails loses job thought told story well instead feeling sorry laughed way trials course could empathize happened career endeavors real lot us thankfully able relate lifes story find way come ahead startedi especially loved writing thinking head recommend reading memoir ill definitely read jen lancasters books',
 'five stars read',
 'five stars great educational tool grandson',
 'always know turn good end love sherryl woods books fun relaxed reading',
 'like series good series',
 'spellbinding quick read good read nora roberts fan way personalizing characters think end book sorry continue story chapters felt left hanging bit',
 'drawings clearer photos wonderful book beginners well experienced dancers particular one best references barre exercises center steps well covered actually first choice recommending barre exercise references reason enough include ballet library',
 'liked liked',
 'lost rogue finally meet simon stbride much th mystery expected nothing jo beverly ever truly disappointed go read reviews know like prefer pass regardless author',
 'awesome another scary well developed story well told lisa gardner',
 'good mystery read thoroughly enjoyed mystery thriller couldnt put finish like good mystery lee child seems know get attention',
 'neat book really interesting shows maybe kind phenomenol experiences',
 'five stars good book',
 'great techno thriller silver tower excitement page one tells story us space station must direct us forces soviet invasion iran consequently find target advanced russian forces plus rare female protagonist technothriller great work master genre',
 'really enjoy watching netflix read book realized netflix really enjoy watching netflix crime shes paying time came different life prisoners loved perspective',
 'love keri smith wasnt exactly expected believe fun keri smith creates magical books fun wreck book',
 'really enjoyable book favorite series one good old fashioned well written pirate romances really enjoyed love first sight theme sticking lots ups downs noticed comments love first sight unbelievable first sad youd find unbelievable fell love true love night met civil ceremony renewed vows ceremony family friends three months later surprised vowel renewal th anniversary almost  yrs ago still much love today met experience ladies realistic believable even lets forget fictional stories written creative mind entertainment purposes reality really enjoyed tears emotional touching ending thing didnt enjoy depth sex book opposed books sexual scenes graphic one disappointment gave much didnt balance area books',
 'buy ms thompson outdone novel spinning tale love military background dose family life weaves beautiful story homegirl style touches difficult impossible single mother military wanting looking love found honeymoon phase new love hard keep every emotion ever tied novel laughter tears anger elation rolled  pageturner get summary book read find details book buy best money ever spent',
 'annoying unreal know book descriptions dont tell book like tuckers character sally plain stupid unreal first whats thee thou thine annoying second lives wild west naive unrealistic view people around third shes deeply religious takes kinky sex pretty well cant get pass fact tucker killed man save doesnt deserve tucker throughout book keep thinking real youll read book know mean sorry hate much find somewhere rant bit',
 'opinion latest book coulter usually difficulty putting one books quite one enjoyed sherlock part however part borders supernatural fan type story due enjoy book much usually continue look forward books',
 'fools gold keep getting better better cant wait next book dont want end waiting fords story us next susan leaves us wanting keep coming',
 'unrealistic could someone extreme head facial damage described bookheal incredibly beautiful especially back s medical care today found unbelievable',
 'honest kushner always honest spite fact religion lot isnt insights counsel without equal others good surely none better',
 'southern fiction book started feeling kind pedestrian picked lot twists turns made interesting good read',
 'great story wasnt sure would like book wound loving yes main character bit prividged childish times redeems end fact read  nd book series likewise pleased fact author become one favorites',
 'love john sanford another author read enjoyed years hardly wait next book come',
 'love chef samuelsson book watched guy top chef masters next iron chef adore wanted know found wanted know book riveting dont like reading bios like listening driving touching story excellent man excellent book would give arm leg eat one restaurants',
 'crackerjack followup devil least man fancies stalking streets atlanta  leaving wake mutilated corpses initial carved foreheadits auspicious time goings good citizens city celebrating industrial rebirth opening international cotton exposition william tecumseh sherman destroyed city coming back honored guestintent stopping murderer tracks ring atlantas leading businessmen summon disgraced former lawman thomas canby dangling carrot opportunity restore reputation canby fought union late war though patriotism paired cyrus underwood citys first black police officer might expected tense moments two become acclimated one anotherthis wellpaced atmospheric narrative shows men fully capable evil without help occultguinn written crackerjack follow edgar finalist first period mystery ending scribe dangles promise adventures follow team',
 'loved laughed loud fannie flagg never disappointed yet usual stories hooked first page couldnt put finished',
 'five stars book series awesome love fact could follow three guys book',
 'five stars good book really like author',
 'love must read laugh cry lol cant starting another book written dorthea',
 'fatal chapter lorna barrett cant write mysteries booktown series fast enough everything want cozy mystery',
 'four stars sweet little book grandchildren delivered promised',
 'compelling reading read book record time couldnt put speak characters real rachel mess hit rock bottom life first pages wonder could ever happen improve things read story',
 'great summer read love books fromabout south one favorite humor made gruesome readable loved',
 'five stars enjoyable goods characters plot',
 'musings one mixed bag truly enjoyed heroine surprises dont normally enjoy stories heroine impersonating man really set well hero could find redeeming characteristics reason manwhore say ms warren redeem character extent end romance satisfying ending think would nice epilogue tie everything bowwhile found sexual scenes wellwritten tended overpower story would liked bit introspection hero heroine never felt like understood motivations hero particular left much desired nonetheless enjoy book sagging middle increased dislike hero perhaps one appeal others happy reading',
 'five stars fantastic read',
 'good read little slow times good story thought took long time get moral story',
 'love lisa see love lisa see care book subject matter okay standards view',
 'courting justice loved story well worth wait peyton angelo finally made move everyone expected two years angelo fighting feelings peyton finally makes move help peytons best friends sam macshelp peyton finally opportunity study true feelings angelo well spent several years walking around chip shoulder heartbreak love fact angelos mother made sure closure subject inviting family attend thier engagement party met match allow get away anything able visit old friends previous books lead fact lee madaris next line meet match prepurchased book read day even working  hour shift rated book  stars highly recommend',
 'ok think k higgins first published book correct could tell read recent books could see evolution writer flashes current writing style could also see room improvement several areas really liked protagonist male lead one well setting nice light read like said room improvement one including making millie silly joe could edited somewhat worth read',
 'thinking getting cat never cat read became backyard astronomer fall  bought modest newtonian reflector started become wonderful obsession night sky spring began thinking next telescope would would plan use looking telescope websites reading forums came across rod mollises book unlce rod known many writes wonderful manner giving information history scopes question pointing likes dislikes even though cat may find home year enjoyed book chapters get one problem even think youve done fun informative read descriptions cats available today fact book devoted using cat book  found one two scopes longer made actually still made renamed different mounts glad book resource even new cat reality get one know valuable info fingertips book',
 'seemed beginning end seemed beginning end character flawed one way another constant cold spring harbor',
 'fate feted comic book industry im longtime fan brad meltzer id read comic books came upon novels recently happy discovery since im hooked stories unravel historys mysteries brads novels like television show challenge history know adventurous mysteries youd call pageturners wont put book reach last page',
 'looking forward brit fbi real page turner get characters accounted wont able lay',
 'five stars read',
 'five stars good book',
 'another great story lee child keeps coming new adventures jack disappointed yet havent read books special order goal read',
 'four stars interesting',
 'reader leave book unmoved world long scrutinized chinas one child policy struggling understand nation could impose strict prohibition people though true country serious problems overcrowding wrapping head around harsh edict involves much thinking heart headmei fongs book answers questions rest world asking since chinas policy first put place filling blanks adding startling new journalism regarding chinese governments treatment children overall opens closed society exploring great depth perhaps book done fascinating sociological study people well opportunity whove lived express impact liveswhat makes book gripping author mei fongs telling story struggles fertility emotional toll takes personal experience puts one child practice stark perspective powerful story told powerful writeri cannot overstate importance book begun demands read end nonfiction urgent deserves read widely possible though china since ended one child policy echoes lives permanently affected long history need heard reader leave book unmoved',
 'everything says true everything says true peer reviewed nejm happy paper thin relationship big pharma publishers conflict least represents someone standing jama chicago example rolled least thats way saw iteither way america consuming unspeakable amount dangerous drugs often harm good amount spending advertising counter overwhelming negative results beginningmarcia done great job exposing trust american public violated favor profitbefore give children anything study actual benefits vs contradictions doctors pushed box well finding one well informed truthful mustim drugs used prudence knowledge book exposes clearly crossed line',
 'three stars ok',
 'must read become familiar uncle rods wealth experience entries cloudy nights forums discovered online used cat guide wow lot information equipment junkie anyone interested astronomy coupdegras updated guide cat scopes almost every topic interest amateur astronomer read every word several times read whipped old credit card bought last scope trying reach observing nirvana would saved lot money frustration ceo popular educational publishing company know much work love goes tome extremely date easy fast moving hobby',
 'adorable first book read gena showalter wasnt lords underworld book say loved found laughing book jillian marcus riot together usually like female lead character smart mouth book exception like romance lot laughs one',
 'love gideon eva one favorite book series couldnt put start finish gideon eva one favorite book couples cannot wait release next books series',
 'great guide many copied raised children ita classic great guide many copied authority',
 'love still loving everything jules bennett writes another keeper',
 'sturdy toddler fun sturdy little book contains eight pages narrative loosely follows fire truck leaves station rescues cat tree page moving part entertaining kids whove seen lifttheflaps four flaps well firefighter sliding pole truck going hill hubcaps turning trucks ladder lifting catthis board book heavy sturdy flaps may torn moving parts arent going anywhere',
 'four stars funnylove jon',
 'reader friendly concise recently studies shown much american society vitamin deficient may due constant reminder stay sun bodies become challenged juggle trying get enough vitamin main source sun foods provide vitamin enough supply body alternative using supplementsvitamin dummies gives reader basic understanding vitamin importance personally reading book gave refresher already knew extensive research diagnosed vitamin deficiency wish would privy book like information concise condensed laytalk making easy understandthe part book liked best chapter  ten myths regarding vitamin one myth addressed protect skin completely sunscreen according author statement true person uses sunscreens correctlyi recommend book concerned enough vitamin diagnosed deficiency believe able glean enough information know exactly next step either get sun use supplements maybe thumbs',
 'quirky nicest possible way hard classify one level vintage travelogue trip turkey camel another turkey book british travelers writing turkey books clash cultures two staunch members church england explore possibility starting mission school turkey also personal story love lost funny magical quirky sad one best works fiction ive read long time',
 'great read one great easy read shalvis hilarious witty humor character development definite recommendation',
 'great summer read really enjoyed book quick read thought characters great deal depth however thought storyline bit predictable ending seemed bit rushedi noticed last ms steel books read maybe authors need quit trying pump books concentrate writing one really good novel every year opinioni would recommend book anyone looking great book take vacation',
 'good book book surprised didnt know expect started reading surprised read much polish population also suffered occupation german russians surprised much irene go war much compromise order save friends wasnt able contact family recomend slightly different story',
 'page turner read reviews comparing story gone girl dont agree perspective interesting read found prediction whodunnit correct around middle book characters difficult likewhich suppose could comparison gone girli enjoyed book enough finish quickly page turner gave  stars simply reason moves pretty quickly',
 'beartown recommended highly someone respect beartown recommended highly someone respect would liked much better less exposition fewer words said good story well told lots takehome lines plenty think',
 'enjoyable easy read book written upbeat light style explain great deal aspects human behavior authors attempt tie many recent psychology books call irrational behavior call deep rationality clam approach new maybe doesnt seem much different predictably irrational writers like dan arial say except fact emphasize tie evolutionary advantages ancestors greater degreeall worthwhile book didnt come away great new insights liked douglas kenricks previous booksex murder meaning life psychologist investigates evolution cognition complexity revolutionizing view human naturemuch better gave five stars choosing would get book insteadhowever still recommend book especially people havent read lot behavioral economic type books',
 'original take wellknown classic actually love things shakespeare really excited read book completely original every wayjuliet immortal nothing like thought would wasnt bad thing although little confusing first stacey jay quickly sucked juliets world romeo hated existence saving soul mates falling love superawesome new boy ben impossiblei adored everything juliet sweet caring incredibly intelligent yet fire hard admire expected hate romeo way couldnt dislike boy obviously suffering past decisions every single dayoverall really enjoyed juliet immortal felt things simply breathtaking others little unbelievable didnt lessen enjoyment factor one bit even havent read romeo juliet still pick book buy borrow definitely book shakespearelovers bookshelf',
 'read retire book losing job life changes sugar coating phony stiff upper lips real life told real person',
 'charming stories normally dont read many   novels normally short usually im left feeling let however novel doesnt really favourite story married missouri two stories okay well found rocky mountain wedding end predictably villian coming hurt heroine husband coming rescue found ending bit rushed like epilogue married missouriloved itloved itloved atypical heroine man whos perfect match lucas lizzie couple cute enjoyed carolyn ended story alaskan groom reminded another novel bride bargain set alaska enjoyed compilation look forward novels authors',
 'interesting read two successful lives cogent interesting book two gifted ambitious brothers emigrated america made mark interesting compare styles points view different united determination find ones right way succeed enjoyed although find profound',
 'two stars one best slow get end',
 'heres johnny seemingly without rancor accusations book insiders view life johnny carson hard put',
 'good read lot romance little sex lot humor tears little angst heroine footloose barely stand come home visits hero rootbound loves small town life sparks fly highly recommend',
 'real pageturner absolutely loved book could put proof history written way interesting everyone history professors parallel stories chicago worlds fair serial killer preying young women nearby worked well together author great job switching back forth two stories without losing reader larson good writer really knows bring history life reservation work takes liberties history recreating dialogue example history professor bothers thats books bestsellers mine guess event think book keep anyone interested start finish',
 'another fastpaced sophie katz romantic suspense second book series second one read book mystery writer sophie usual clumsy fun sleuthing skills help sister sophies sister leah murder suspect time sophie enlists favorite pi help find real killerthis another quick fun read kyra davis glad see return anatoly development relationship sophie sure would return last book also got see little sophies family book quirky friends still present welli definitely recommend chicklit romantic suspense loversnote review originally posted goodreadscom',
 'madman passionate genius hear word lobotomy probably first images leap brain movie one flew cuckoos nest portrayed procedure control unruly nonconforming patients lobotomy many respects become symbol american medicine worst wellresearched detailed book takes unflinching look dr walter freeman man along partner dr watts pioneered procedure era alternative taking care seriously mentally ill warehouse enormous state psychiatric institutions psychotropics available time treatment really available psychoanalysis therapy works best welleducated articulate nonpsychotic patients freemans work zealous sure however made link organic disease behavioral signs symptoms well ahead others eyes freemans journals voluminous family members colleagues author reveals passion men intent freeing patients bonds serious mental illness makes reader question preconceived notions well wonder new era psychosurgery',
 'five stars excellent writing suspense',
 'excellent garden guide love book includes lots ideas sketches good thoughts summarizes many sunset articles organized way using help convert front lawn beautiful edible garden along couple books including beautiful edible garden written leslie bennett stefani bittner',
 'beautifully written thoughtful important book beautifully written thoughtful humaneevery one read itthank dr gawande',
 'love love love easy read series teenager far author went writing way far alienates adult readers best lot stories series dont fall love immediately left high dry get time characters',
 'interesting story well written story interesting captivating writing wasnt way written like student paper describing event riveting tale told experienced story teller',
 'good good read would recommend one even hadnt read first  one catches upcant wait th one',
 'awesome book crais really able help us understand k unit dogs perspectiveincredible read loved every word highly recommended',
 'outstanding book would moving wonderful ever pleasure reading truly remarkable author light reading tragic world lucky indeed person author use medical skills describe feelings us would wish however must express love animals ways absolute must book',
 'must read classic short easy read want read',
 'make best sellers list dont know book made best sellers list guess people like garbage takes forever get point boring book author seems assume know characters personally reading book timeline jumps place like reading thousand different descriptions train go ahead buy book',
 'read start finish one sitting storyline easy follow enjoyable author taken lightly gal edge suregreat reading bought left wanting',
 'fancy owls coloring book drawings well done showing lot artistic flare fun relaxing coloring designs',
 'happened vonnegut could man turned galapagos slaughterhousefive cats cradle turn yes give favorites credit parts hocus pocus excellent parts sections creative funny drags glue novel elmers nothing interesting abnormal material holds novel together get novel worse gets droaning reader hits good part good part good typical vonnegut rules good part kilgore wrote stuff read time dont prevent reading something else first',
 'nora good read story strong woman suspenseful tale love story wish men like character',
 'great book great service great book delivery came good amount time one scratch',
 'excellent havent read sandra brown novel several years came across one weeks ago well written well formed characters end surprise still enjoyed much',
 'another great one cal niko like dynamic duo hell ninja monster could ask',
 'five stars great book',
 'five stars enjoy highlander books one crossing timespace',
 'five stars aaaaaaa',
 'birdie dan jenkins dean american golf writers count hes covered  major championships  years various publications beginning  u open selected  best perusal lifts book usual collections columns sportswriters superb organization organized chronologically easy reader follow march golf history forward fast fun read columns short pages really fly could also seen negative however medium column rarely offers one space give indepth holeby hole account jenkins usually limited much general impression tournament left luckily us thanks considerable skills feels like enough casesjenkins majors absolutely essential reading anyone loves game especially fans whose golf consciousness began tiger era',
 'five stars great',
 'reread old favorite read first came really didnt remember much enjoyed story definitely worth repurchasing',
 'another great death book love death books another solid addition series havent read series recommend starting beginning first book naked death',
 'fabulous first step world adult coloring books back s working childrens play therapist would often color clients client cancelled appointment used kids coloring books office found enjoyed much fun discover adults catching coloring demanding mature themed coloring books first purchased set  crayola crayons may try coloring devices future date linking inner child loved crayola crayons part joy sixty years old still remember wonder receiving first box  crayola crayons love cats many fond memories cats shared life colored nine offerings first  hours doubt ill keep pace wont long im ready another coloring booki purchased one another dogs sister christmas along  pack crayolas hope likes',
 'tubby quirky tubby works hard help man accused murder meanwhilehe seems danger doesnt realize almost late',
 'recipe book healthy eating birthday gift believe friend likes shes shared recipes trying together',
 'worth read getting lucky quite good head heels first book trilogy still quite good intertwining plot lines kidnapping zachs sister andher fianc colombian revenge zach well written kept attention main characters zach lilly likable enough although zach quick jump conclusions lilly renting room sister zach assumed con artist swindle sister inheritance lilly strong female lead knew could correct zachs misassumption wantedto self confidence let think wanted extent lilly went swear wear little thin awhile probably real complaint book two finally got together susan andersens books relationship sizzled overall good book well worth read',
 'must read fantastic book facts competent understanding cant keep',
 'enjoyed lot fun someone interested victorian period reasonably well researched',
 'three stars easy read',
 'four stars son says little slow parts second book',
 'great suspense really enjoyed reading really kept interest book hard put continue buy series',
 'typos spelling errors distracting great information spite liked information however kindle copy lots spelling errors little editing would go long way better experience',
 'charming funny mystery greeting card designer serial dater wollie shelley returns harley jane kozaks delightful third novel dead ex wollies exboyfriend terminallyill soap opera producer murdered best friend also dated prime suspect wollie must jump headfirst sharkinfested waters hollywood find really happened clear friends name dead ex charming book welldone mystery plenty fun characters',
 'princess plot truth jenna wasnt really interested trying role movie besides mother unbelievably overprotective would never approve still goes along friend bea surprise lands part suddenly shes plane kingdom princess bears rather surprising resemblance reassured constantly mother knows shes kept completely deliberately touch jenna begins realize something going good needs escape gilded cage find truth truth involves rebels mothers hidden past relatives didnt know fate kingdom jenna quite actress pull reallife role isnt fantasy set modern times appeal fans cabots princess series theres danger suspense mystery oh bright pink cover wont hurt',
 'winner marvelous story telling seamless transition original author new author consistent spell binding characters highly recommended',
 'bear snores wonderful interesting book read loud rhyming beautifully done clever word choices good story teller book sort helps flip pages youll see letters capitalized gives cue emphasize part someone great talent story telling feel style constrained cues ignore means fun part reading picture book loud express words style think cues probably authors way setting book differently others could also authors reading expression literally translated onto paperwhatever book fun read',
 'stop get book jamie goes way beyond shop closet idea great examples help get organized recoup cash helping planet reuse repurpose recycling cluttereveryone needs read book keep close references know keeping mine next computerinspirational',
 'gathering prey excellent like letty hope lucas still needs gun mr sandford needs continue prey series please',
 'perfect tool teach child emotions bought book two year old starter show emotions specially anger book really helped teach emotions remimnded emotions natural respect together book discovered ways express healthy way',
 'five stars great book teachers read understand kids',
 'students enjoy working various chapters used book creative writing class teach students enjoy working various chapters produced solid work result',
 'smart well paced thrill ride vanessa michael munore vulnerable catch taylor stevens riveting tale piracy somali coast africa heroine vanessa michael munroe needs time harrowing adventures took place last book returns africa michael persona takes thankless job problem solver djibouti circumstances soon change finds unwanted outsider helping guard cargo ship coast somalia pirates attack ship story roars therethe catch told entirely michaels perspective brilliantly written pageturner shines insiders light modern day somalia djibouti story line complex michaels motivations nuanced added greatly depth storylinemichael completely extremely limited resources book seeing gathers sifts information dealing ever difficult set circumstances fascinating readertaylor stevens consistently delivers smart fastpaced adventures take readers deep exotic locations able learn different cultures carried along highstakes thrill ride catch favorite book  highly recommend enjoy smart wellwritten thrillers',
 'five stars awesome book great illustrations must',
 'good discussion book clubs book club good discussion would recommend book clubs found negative thoughts constantly throughout book devious characters much really enjoyable read gave four stars good book club book',
 'soo much great info must book  months signing youve read  mortgage books already might need bookauthor goes great detail home buying process refinance process unique situations author spends great deal time explaining terms bogus charges like example section prepay penalties helpful us since know demand removed signing instead finding  years going get charged k due ignorance time purchasei purchased book finding uncommon deal top new york lawyer explains buy home lowest possible price together two books bit online googlingupdating feel prepared first home purchaseit amazing saving    book purchase',
 'authentic vietnam dying place novel carries authenticity stuns reader reality vietnam like well done sergeant',
 'must sisters met author book fair st pete fl last fall enjoyed verbal stories say loved written work characters real',
 'five stars love',
 'four stars helpful sucks couldnt rent instead buying online',
 'really neat book oh man e noted book complex plot swhat made great read good u enjoy books sure didpenny garrison',
 'excellent love book biggest challenge deciding colors use many possibilities beautiful designs good quality paper well placed page tear picture still centered pagei use colored pencils problem bleeding thru',
 ' books good first timer know hobby  books good first timer areas wasnt disappointed trip would find stones without',
 'interesting read little hard follow found storyline incredibly interesting author compiles various interviews different sources well personal interviews reading sometimes difficult follow talking since called person different names also chapters ended leading climax thought youd continue next chapter author would switch stories come back  chapters lateroverall enjoy book recommend read anyone interested music business',
 'fantasy horror fantasy basically overactive imagination two boys middleaged man carnival came town entertaining sure',
 'con job remember enthusiasm controversy book caused came  years ago took long time figure problem fraudulent premiseyou cant test effectiveness antiaging lifeextension regimen faster rate humans happen live blows completely transhumanist singularitarian nonsense conquering death becoming immortal arbitrary date century say  plenty people alive survive year regardless ones currently age  soin durk pearson sandy shaws case published book around time turned  yearold americans taken care still look good enjoy good health way pearson shaw enough baseline age substantiate claims discovered practical scientific approach promised cover anything following advice could antiwork damaging health shortening life insteadand youve seen recent photos online youve met person youll notice look ish like everyone else age experiment failed obviously fortunes made suckers bought books products years probably funded comfortable retirements',
 'powerful storyteller finished bob dugonis book quickly felt fantastic story grabbed interest right start never let go characters drawn well feel part action like another world would hesitate recommend book anyone well worth time read good fiction book excellent go bob nice job booki pleasure attend one bobs presentations writing chicago long ago really outstanding teacher really enjoyed writing course get bob dugonis books worth time read',
 'written like chidrens book ive huge apollo fan many years watched moon landings young kid think ive read every book space program s s although respect michael collins couldnt get book written like childrens book went tangents even related apollo  mission want good book get deke moonwalker last man moon',
 'slow finish love historical fiction expecting gobble book faster sonic burger favorite food days however could get one stopped  pages picked something else insteadi tried several different days read season never felt invested story dont know openmouthed flycatching expression girl teal dress cover despise wideopen trap like didnt finish one despite read several positive reviews',
 'five stars lots action good reading',
 'five stars harlan coben doesnt disappoint great story kept dark end',
 'good obviously quite used good obviously quite used',
 'wish different ending wouldnt turned better sparks talented drawing couldnt put book wish different ending wouldnt turned better sparks talented drawing story',
 'another masterpiece reading howatch lately enjoying old friends finding new ones lady really really write particularly enjoy writing different sections book perspective different key characters truly helps define illuminate character wheel fortune may favorite never felt tedium novels notably church england series produced satisfying book reader starved literature',
 'five stars great book fast read',
 'awesome read love pamela clare awesome job crying laughing couldnt put holly nick great couple',
 'fun amazing life different completely honest othersfun laugh loud read',
 'sweet love golden books trying collect ones kid never one step daughter learning read really enjoys',
 'good story enjoyed reading book found hard put kept trying figure done avail pd james really knows keep ones interest read several books havent read one didnt enjoy reading everyone books completely different others comes good stories',
 'insightful wise charles krauthammer one intelligent lucid commentators writers time always insightful fair trust wisdom looking forward weeks book come downloaded midnight last night soon available started reading everything hoped voice reason politics society todaydorothy hensrule',
 'oh wouldve given  stars think author may gone little overboard detailed description things fair amount internal dialogue really annoying case allowed reader see hh relationship fell love loukis character knew thoughts intentions liked plot characters love story h wasnt arrogant damaged give h much needed tlc',
 'love movie bought husband love movie wanted see different book thanks mary ellen',
 'terrific book characters sisters terrific book characters sisters family could never boring',
 'five stars great book',
 'amazing book read visited malta made trip unforgettable experience returned trip malta planned months ago im fan military history reading book proved astonishing experience combined visit places great historical events  took place mr bradford amazing talent bring history life storytelling ability truly fantastic walking along valetta seeing st angelo senglea mount sciberras st elmo marsamusceto mdina many places felt lucky indeed chosen wonderful book order learn details great siege book achieves aim perfectly tedious overtly long many interesting details describes famous personalities covers weapons tactics fortifications strategies knights ottomans th century many little known episodes great siege told way fit perfectly big picture maps although need order fully understand events unfolded particular way certain reading book first make trip malta fascinating experience every history fan',
 'five stars good',
 'dan simmons great storyteller eerie exploration return dan simmons great storyteller eerie exploration return youth exception makes past present seamlessly come together gives us thrills along way nice read little bit thought',
 'excellent book true page turner excellent book true page turner captured attention couldnt put loved much order husbands secret',
 'harrison proves three good solid engaging novellas written humor feeling sexual descriptions bit optimistic still highly recommend',
 ' works lifetime making poetry personal thing much think prose say reviewer like selection poems say chose order book poems first reading poem two three poet spoke new england author possibly found readily readable relatable definitely donald hall new poet would recommend volume high school college curriculum anyone enjoys poetry mature poet experienced full life maybe goth punk beat poet would also find work  inspiring thoughtful',
 'shadow spell loved relationship six spell binding read looking forward next book find happens',
 'great story mark twain great american writer story inspired many jimmy buffet songs worth reading one fact',
 'figured stranger getting info figured stranger getting info never figured read wifes whereabouts',
 'good read insights japanese good readthe insights japanese side',
 'fine class compatible mac book fine class aware cd comes compatible mac computer well online help tutorials books offers',
 'awsome look good book good better story clean dont need bedroom stuff good love storyits hard find great',
 'funniest book ever read bar none book hilarious never laughed loud reading book like milkrun trials tribulations singlesdating scene exposed funny bookthe book quick read need good laugh pick',
 'good combination scifi legal rule evidence third jag space aka paul sinclair series good solid read characters well developed setting reasonable nearfuture space adventure good deal legal detail may taste readers fairplay mystery clues solution unfortunately correctly guessed solution half way book readers less paranoid suspicious may see coming nevertheless found enjoyable read mr hemry description navy shipboard life rings true',
 'good much usage unnecessary language review appears book  applies good much usage unnecessary language characters need development suspenseful say would take away plot twists',
 'good hard take break reading held interest way another later life',
 'funny books funny books large number characters remember keep knew would eventually come together',
 'simply wonderful definite keeper read whole thing one evening couldnt put one heart touching story',
 'disappointed grishams camino island im sorry dissent majority readers disappointed camino island initial narrative crime pace story slows crawl mostly woman characters thoughts chatter among friends little drama action suspense frankly opinion especially compared grisham novels book simply drags boring never thought would say grisham book wonder one intentionally written appeal women readers third way middle conversation bookstore owner male women writers grisham adds fact seventy percent novels purchased women wonder thats way telling readers one seem different others much novel woman memories grandmother longing love circle female friends suspect male grisham readers notice change style lose interest know keep waiting something anything happen',
 'hidden indeed really enjoyed book especially suspensefilled plot two women mourn death man one wife coworker lover alternating points view deceased jeff wife claire tish slowly reveal clues jeff may kept many things hidden lifeexcellent story real pageturner highly recommended',
 'christmas cookbook family traditions sister saw copy cookbook last christmas plan give copy bought thanksgiving help herchristmas menus baking gifts used dinner suggestions made great cookies grandchildren receipes book',
 'exciting mystery review concerns original  edition well edited  edition plot similar original bess cousin dick asks nancys help valuable old oriental vase display stolen pottery shop also tells nancy pit china clay supposed located somewhere around river heights near leaning chimney clay used making fine pottery could make dick wealthy could find finally owner stolen vase mr soong seeks nancys help finding missing friend mans daughter came america years previous disappeared arrived mr soongs certainly one interesting books nancy drew series book quickly grabbed attention managed hold final page really say happened  books series quite bit action book writing good least original edition unfortunately revised editions writing dumbed enjoyed book beginning end think nancy drew fans would place one best series lists',
 'good traduction cloud unkonwing classical one three best caltholic church contemplation another traduction modern english available bestof knowledge one good',
 'real vampires curves gerry bartlett really enjoyed book fun really like idea vamp trying make way world isnt rich shes pack rat trying clean isnt brooding fact doesnt want vampire dog valdez fabulousgerry created bunch really great characters book even dont like vampires like fun sexy books one isnt lot sex sexy sex',
 'one favorites nora read robertsrobbs books one favorites wish theyd make movie find characters believable likable even crooks part despite criticisms reviewers cop someone says shrink think cops learn read human behavior survival skill roberts knows cops take word someone worked cops odd years plot dialogue tour de force creating tension making reader laugh love characters roberts tension never lets outstanding great read',
 'great characters great characters want spend time im little saddened winter aaida notbe focus next installment series nevertheless series promises interesting one bennett gracefully captures mood twenties gives special paranormal twista joy read unabashedly sexy well',
 'astonishing viewpoint reaps hidden emotions readers soul choosing book read initially simple curiosity resulted visual tour life death germany  prose evokes full range emotions slice history dissected examined intensely interesting narrative source recommended reader brave enough share journey soul death side',
 'first best reread hotel du lac years see booker made aware life philosophy proposed clearly brilliant writing depiction characters characters jump page edith mr neville eventually got used anita brookners predictable plots first book burst upon reading public rightly enchanted',
 'bad fiction weak far first novels author goes marketeering first succesful book nothing new nothing surprising bits pieces subject surely make lot money repetitive course bad fiction bad true novelists',
 'three college basketball icons profiled great genre sports fan book great interest',
 'ghosts creepy book great ghost book part washington dc ghosts live',
 'kate remembered book make laugh stubborn ways weep read decline health written friend scott berg peek katharine hepburns life totally irresistable miss hepburns personality enormous books making african queen read talking right similar charm shines also loved book simply absorbed arrived door',
 'great addition series great storyline continues latest book love interesting background story lots previous characters fill book course great steamy romance',
 'books obviously amazing read need books obviously amazing read need buy books immediately take vacation work family friends read books asap',
 'another emotional read ms macomber denim diamonds debbie macombersilhouette special edition  december its nine years since letty left wyoming chase pursue dream singer la teen shed watched mother slave ranch put aside talent artist shed vowed shed never settle shed follow dreams first shed begged chase come tied land knew shed back well nine years along time wait chase became bitter fact letty also daughter eats letty returns wyoming hoping rebuild stable life daughter nine years apart hasnt changed love lettys always felt chasethis another emotional read ms macomber lettys come back wyoming lot regrets mother comes see mother never settled used talents ways chase cant seem stay away letty invisible thread heart still much intact pleasure read storyread also sequel wyoming kid har  july ',
 'a great story line left many loose ends next book tie mexican drug timely',
 'great book fun book good ages',
 'two stars  page story expanded form book',
 'short nice history peter great short  locations nice summary life peter great free volume would recommend anyone wants overview peters lifethe author good job sketching peters life showing good parts well bad partspeter definitely guy would done darndest avoid got idea head',
 'nice baby beluga helpful singing along illustrations nice illustrations baby beluga',
 'read iraq think read iraq think bush cheney rumsfield tried war crimes',
 'muddled story confusing many characters great many venues easy read characters almost like science fiction usual easy read author',
 'great book helpful windows xp person computer life switching  problem  foe dummies fixed',
 'considerable room improvement latest hope last book therh least thing solved notat clear audience author aimingfor afraid even spell suchbasics euler product clearly wholeenterprise doomed first third book maxim picture formula worth athousand words inverted essentiallya total loss redeeming features sectionson connection quantum chaos random matricesstill considering resources available authorone expected much',
 'hot romance long time since ive picked harlequin romance novel jane porters dark sicilian secret reminded loved much redblooded woman wouldnt want tall dark handsome man sweep feet thats exactly happens jillian smith raised daughter mob king pen lives childhood witness protection graduating college escapes europe meets vittorio dseverano flees upon learning may head notorious dseverano family soon finds run cannot hide ruthless man especially hiding infant son book awesome love scenes wont staying away long time enough suspense keep intrigue harlequin fans definitely want pick one',
 'love harlan coben six years great kept guessing love books ive read great stories hope theres come',
 'three stars interesting info white house functions staff also interesting info first ladies',
 'loving marshalls series let start saying really liked book loving marshalls seriesnow wonder author went many ways times heroine hero shot lost count poor couple good thing got pregnant book started otherwise wouldnt much chance shootouts author also great job leading us readers place one string pulled unraveled another another didnt know bad badder yes said badder worse others till endthis great series cannot wait rest foster brothers books full action suspense great chemistry hero heroine',
 'thought provoking hard sf imaginative hard sf fan',
 'five stars great book',
 'five stars loved',
 'miss mr monk wow though watching series tv oh miss pleasedi read craziness loved',
 'read books series enjoyed ive read books series enjoyed easy reading twists turns somewhat predictable enjoyable',
 'wonderful workbook practice sylvan third grade reading comprehension workbook wonderful tool use child trouble reading reading comprehension allows think come concepts fun easy way reinforce without making seem like schoolwork child actually enjoyed work book seemed disappointed book finished child needs little extra help workbook activities fun colorfulwith summer approaching would good way reinforce already learned keep track forgetting struggling definitely recommend workbook hope purchase',
 'wonderful books oh good read books taken cruises read vacationing gets beginning prices buy slighted used book incredible going buy new ones vacations also home heck yeah best thing since sliced bread',
 'five stars book arrived indicated condition indicated',
 'four stars loved books',
 'five stars great',
 'special little gift big price review dead starts thank charlaine harris special little gift felt like many bios ones created wanted share readers chuckling loud made unexpectedly teary eyed made think ms harris isnt sure shes quite done writing stories characters quinn barry bellboyi read whole book hour enjoyed little gem  stars instead  im greedy reader wanted details ill add extra half star certain little surprise character appearance dead wasnt expecting thrilled  starshowever penguinberkleyace showed greed one paid  hardcover preordered stretch imagination  book even  one  book shame penguinberkleyace choking life cash cow  stars greedso  stars ms harris gift readers  stars publishers leaves overall rating  stars',
 'interesting funny read get  year old hates read read one book month take ar quiz school rd melonhead book actually likes read interesting funny read get tired authors use word said way overused son actually fight reading book  star book',
 'conjureman dies nice book man cried fair  books completely different arent really useful comparison read class',
 'christina lee author watch ohi really loved ella gabby quinns daniels story loved attraction physical emotional got even two different people time meetchristina lee author watch',
 'five stars another great addition ongoing series',
 'could put book danielle steel newest book book woman betrayed people close expect allwhich makes hard put book find happened danielle steel amazing writer',
 'glad kept old falling apart counter title misleading glad kept old falling apart counter truly dont know products listed items need count book wasted good money bitter pill swallow since im fixed income',
 'inferno usual dan brown uses extensive knowledge art vatican history create entertaining suspense novelim looking forward next book',
 'dramatic mysterious book listed mystery starts creepily enough abandoned psychiatric facility night group friends gets trapped one disappears novel takes turn dramatic territory disappearancemurder even relevant focuses characters relationships giving special attention judith whose chapters written first person versus rest third person fans dramatic novels probably enjoy fan whodunits things creepy interested backstoryi chose read book opinions review completely unbiased thank netgalleycrown publishing',
 'well written bit dull known wrote book would guessed early anne tyler novel similiarities style content striking main character interesting quite well developed many rather banal conversations involving minor characters welty fine eye detail good sense time place plot strong enough sustain even short novel im sure missed half point book could bothered read find exactly id missed',
 'three stars didnt relate characters',
 'five stars enjoyed book much started reading stand guy right finishing',
 'four stars eyeopening',
 'rivers end sorry nora roberts missed mark one found boring read end find turned loves loves loves loves ending exciting redeemed story usually like stories much mentioned earlier disappointedwin lose',
 'became much better continued read story line china dolls slow beginning became much better continued read really enjoyed learning time story took place especially sad time japanese taken homes relocated camps ww',
 'five stars loved',
 'actually great read theres something arent needed book com  actually great read theres something arent taking class want improve interpersonal communication skills great stuff',
 'wildly imaginative alastair reynolds definitely one imaginative writers contemporary science fiction terminal world blends postapocalyptic steampunk setting mysterious version modern physics posthumanism admittedly noir element gets little old visible first pages anyway somewhat hard summarize story without giving away much solution starts main character quillons flight spearpoint turns quest across different zones certain levels technology start fail zones begin shift dramatically across world threatens future mankind different stages still consider house suns pushing ice pinnacle reynolds work terminal world among par revelation space actually builds similar level physics fans fire upon deep likely enjoy terminal world didnt like former may want give reynolds try much easier approach everyone else start terminal world',
 'great read one time favorites really made wonder could still medallions love biology zoology absolute must read',
 'disappointing slowest part slow cooking book prep time slow cooking steps kind defeats purpose using slow cooker ever decide sell stuff online stack stuff',
 'fan dan browns work really enjoyed reading inferno read davinci code digital fortress enjoyed well big fan dan brown work',
 'certainly pulls punches book amazing ya novel ive read thus far tackles problems kids face today ways speed development let know theyre alone language aware buy recommend kid author outstanding job creating environment certainly two protagonists emphasis agonists fact want see katie mcgarrys id suspect age real life prescreening ya novels granddaughter since review vine products anyway decided cant trust anyone days comes media placed young ones hands say without hesitation message sent story going straight kayla read despite language think important enough done better talk could highly recommend',
 'harry potter kindle versions great buy paper versions harry potter read recently bought kindle versions rereading rd book kindle versions fantastic nice seven books small device plus add love movies think book versions far better lots happenings included movies movies changed events suit time flow kindle versions much worth money',
 'make living life career coach viable career choice book isnt howtocoach book marketing book howtorunabusiness book directed life career executive coaches many many people call coaches thousands newcomers enter field year small percentage coaches make six figure income vast majority coaches earn less one third distinguishes financially successful coaches others much marketing strategy ability define ones position market focus defined niche operate coaching business business anyone seriously considering coaching frankly type consulting career would welladvised consider business aspects coachingconsulting practice success requires coaching skills enterpreneurial mindset business savvyin book authors talk positioning differentiation market enterpreneurship authors also describe different types coaching executive small business career life skills relationships creativity outline marketing strategies tailored type finally authors provide plenty helpful information wouldbe coaches several appendicesoverall eyeopening book anyone thinking embarking consulting coaching practice great complement many books focus howtos coaching',
 'lucas davenport neutered sandford finally neutered lucas davenport dialogue describing gq items wear today dialogue describing expensive cars take today much less confrontation detection huge disappointment latest line watered additions lucas davenport saga',
 'loved thoroughly enjoyed jean shepherds stories',
 'captivating cohesiveintensiveand well written kept entranced start finish binary anchored researched context tale creatively woven',
 'five stars love anything sylvia browne writes',
 'things really work though guide companion western forest edition print decade stumbled last year concisely provides missing links field guides plants fungi insects spiders reptiles amphibians birds mammals tracks fossils get drift hiker birdwatcher feeder observer photographer amateur naturalist first step usually simple identification species summer warblers course first step actually seeing bird question way traditional field guides provide portable id info ecology version helps understand change see hike beechmaple forest oakhickory stand subtle differences northern riverine forest segues northern swamp means comprehensive remember fits pocket book like science ecology composed seemingly endless delightful digressions galls come dragonflies mate ever bothered learn frog calls vegetation old field tell history volume inference western companion excellent fascinating addition field guide collection',
 'tinkers disappointing usually love first novels fresh insight something say sometimes bit rough originality worth tinkers disappointing hard follow found writing soso story father son intertwined theory forms good foundation book tinkers deliver symbolize inside workings clocks used love antique clocks  used effectively jacket states part tinkers elegiac meditation love loss fierce beauty nature opinion book didnt rise level didnt feel deep empathy either father son book convey beauty nature think paul harding talent good ideas novels believe needs review conveys ideas reader would recommend book',
 'wonderful summer read typical grisham booki finished long plane ride well thought characters twist ending ended time started drag',
 'love want illustrations reason reading next seriesan enjoyable followup first book series main plot centres around group heroes continue journey quest find second piece spark everyone book one appears volume good evil characters inclusion characters joining quest possibly temporarily book enjoyed new characters continue love regulars series plot fun exciting full dangers characters developing especially leader tom warrior priest veni yan randolph enjoy character driven books plotdriven characters personalities motives also quite important plotmy complaint much leadin book refresh memory s book one take chapters get story start remembering also im pretty sure said review book one nearly enough illustrations jeff smith one per chapter part illustrations would gladly welcomed jeff smith',
 'good publisher hard find another fun read accidentally stumbled upon looking something away hot afternoon book made laugh loud characters shenanigans',
 'four stars heed warning',
 'didnt love like stories maybe dont care sin redemption much oconnor doeshonestly era like hard know good guys bad guys',
 'great writer great book really like mr smiths books different anything else find case taken couple dozen stories southern part africa probably rewritten bit published book different stories heritage stand different main character mma precious ramotswe  ladies detective agency would grown talesin fact mma ramotswe written letter published book part reads told fathers aunt old latehow love book like',
 'drawing fun son loves draw book added collection learning draw books user friendly helps teach  steps early drawers',
 'great trilogy great read usual nora roberts didnt disappoint bought first two wasnt able get third book soon finished got book read day would like see series continue f enjoy good love story recommend series',
 'great scifi buffs type book hard put keep waiting next event',
 ' mechanical movements mechanisms devices dover science books book real neat shows many things take granted shows work one kind look past present future one book see get jog mind',
 'five stars great story line loved',
 'shame ms palmer writing automatic pilot long time book exhibit type virtuallymisogynous hero irritating heroine totally predictable story evidently ms palmer huge following one one time maybe try showing little respect write story doesnt follow usual formula doesnt hero heroine one barely tolerate story actually surprise two currently disgrace romancebook storytelling',
 'best historical trilogy ever absolutely loved trilogy books never wanted end full interesting characters enchanting locations marietta multi dimensional character cant help love root thank goodness kindle able read one right another one word caution wont able put finish chores etc first',
 'three stars interesting plot weak conclusion',
 'wyoming best wow boxs writing style definitely matured years read early books  years ago wasnt impressed recently friend suggested read one newer ones due topic loved started reading others start book cant put definitely see beauty wyoming boxs writing takes places ive ive never seenlooking forward late july release take reading day love wyoming youll enjoy books',
 'five stars great item',
 'reasonably good first novel  years ago insulting general townspeople valedictorian speech skye denison left scumble river planning never coming backunfortunately large unnamed disgrace occurs leaving skye fiance job moves back scumble river take job school psychologist arrives talked judge chokeberry jelly contest annual festivaland soon discovers body former scumble river resident happens celebritythe problem evidence points skyes brother police dont seem want search possible suspects despite much opposition skye takes matters handsand risks life processi enjoyed novel every scene skye actual job beyond investigating true pleasure unfortunately could save book little character depth character beyond skye hunches furthering investigation come nowhere opinions changed within one sentence instance take love interest storysimon towns coroner funeral home owner evil nasty first half book downright saintly second half doesnt workhowever good example debut novel heres hoping rest good better',
 'quick read exceptions good could get past fact  knew eachother less  hours confessing one another lovesilly cheesy romance doesnt mean unrealistic',
 'engaging predictable keeps engaged turning page little predictable though enjoy christophers templar series finish',
 'buy well titled review buy actually get three books series loooooooved hope writes lot jane yellowrock series ive read books liked lot well work library know good books im proud owner stephen kings book well anne rices draws lot comparison patricia briggs happen think bit better doesnt go overboard descriptive details like briggs stumbled upon first book skinwalker read everything done since looking new author faith hunter really good one check',
 'harry potter deathly hallows great book best way enjoy book read last six books first full key plots wont mention would spoilers',
 'five stars gift',
 'ok interesting hard tell ok interesting hard tell anything explanations correct',
 'five stars fitting ending wonderful series pacat simply marvellous cant wait read',
 'love ghost hunter mysteries loved books start although im glad handsome heath course im biased love native americans thought book awesome agree viewers book gilley love get nerves bit either way recommend book newest one came christmas eve havent gotten yet seeing stacks books cant tell one yet like series though one good',
 'tea drinkers enjoy like mysteries like tea need say',
 'great book great story characters developed strong feelings want win scare true implications story real world',
 'peggy mcwilliams excellent book best danielle steela must read everyone interesting turn events unlike previous books',
 'one best one favorites demilles standalone novels love work one read still anxious',
 'yet another wonderful play play jaci burton yet another wonderful play play jaci burton wonderful read order much better since get background appearing characters love love love',
 'fun read good book enjoyed reading love book part series find lives go',
 'benfords vision doesnt translate idea brilliant billion years future many successive species homo evolved made mark galaxy passed extinction humanity represented effete geneticly engineered species cloned earlier species called originals possibly homo sapto restore ecology earth space huge organic ships ply space lanes earth solar system plunge toward orbit galactic core orbital mechanics achieved much earlier species comes multidimensional bad guy wants destroy originalswhy knowsthis book begins struggle benfords ideas moving rapidlylike ideas get sleep ideas urgently write forget imagine brain overdrive tries put stuff paper however vision doesnt translate well leaves reader lost multidimensional confusion vision vast vast us could imagine series compact books within universei give book  stars incredible concepts subtract three stars confusing plot jumbled narrative',
 'detailed book begins  ticketron using cdc computer size s copier agents paid month rental earned  centsticket trs also netted amount venue major department stores plays customers shortly thereafter csc opened computicket west cost locations ralphs grocery promoted systems means limit scaling increased advanced sales limits salestransaction ending fraudulent reporting sales comps sports venues added soon afterwards computicket however computicket folded april  lost  million ticketmaster became front runnerreaders difficult time maintaining interest extremely difficult see forest trees far much detail',
 'fascinating book unconventional family really loved extremely wellwritten memoir kept late night wanting stop reading characters like none ive ever met yet totally believable authors family unlikable even despicable yet gained insight behaved author totally sympathetic character blown away much able accomplish despite terrible childhood adolescence wonder siblings didnt die parents negligence highly recommend educated memoir',
 'love spell witch vampire story wonderful read liked romance vampire witch would recommend read',
 'first hand account fom victims holocaust powerful terrifying enlightening accounting person experiences adult survivors holocaust written personally matter factually ithumanizes horrific time history needs told last generation actual victims passes away thank revisited nightmares experiences forget',
 'shipped quickly book gift arrived excellent condition',
 'lighter touch dark loved th book series great story lots adventure lots romantic sexy times cant put books read days time entertining spellbinding great read great writer',
 'pick good weekend read reading series far know youre getting kate still oddly weak badass time nasties creatively menacing pick good weekend read',
 'three stars expected however find interesting',
 'remarkable love story amazing love pictures incredible story formed pictures love every book read far lois lowry cannot put put book read find ends',
 'hes back mean lucas davenport fourth prey book glad many go hes back means dr mike bekker drugged psycho killing way new york city eyeballs everywhere pace little slower prey books lucas davenport great detective always worth read would happier back minneapolis hey john sandford calls shots one double female detectives deal lily rothenberg ex davenport barbara fell prey series rocks highly recommended',
 'excellent text somm level  somm working next levels easy read excellent study guide advanced read sales service wine professional',
 'easy fixes plot complex enough interesting characters complex issues easy fixes plot complex enough kept guessing like cant figure ending halfway point',
 'five stars love worth reading contains beautiful life lessons well universal questions',
 'sweet tender meet michael nikki wont regret series sexy passion sweet tender along action intrigue something everyone',
 'enjoyable really enjoyed book loved strong characters dont know anything navy seal training cant comment realistic women seal training however enjoy realism fact hero didnt want heroine seal pushed safe felt pressure compromise dream could together feel thats real life women want kids going expected compromise personally feel worth parent best important occupation worldthe book flowed well easy readmy main negative ending felt bit abrupt wouldve loved epilogue spoilerto show really make work wasnt lip service emotions moment end',
 'great book really interesting story amazing educated smart people corrupt guise caring sad professional book also brought anger ignorant people yet empathy time well written got little confusing keeping characters straight really didnt affect core story',
 'edgar cayce collection great complement enhancing qi life force edgar cayce collection great complement enhancing qi life force miracle healing book similar saam medical meditation practice meditational practices like yoga qi gong try incorporating saam meditation technique procedure meditation based upon  year old korean acupuncture technique instead focusing chakras third eye saam meditation technique brings ones qi attention acupuncture points hands feet specifically target twelve primary organs like heart liver lungs stomach kidneys etc meditation upon hands feet strongly stimulates brain intensifies sensation healing effects qi across whole body saam medical meditation pictures four point acupuncture combinations organs point meditation easy locate learn one fascinating adventuresome meditation techniques ever experience may deepest meditation go try',
 'definitely read government spending perform private sector creating jobs long term economic gains generations important book reframing conversation american economy work everyone',
 'lovin fools gold loved great read love romance',
 'nice escape niece enjoyed book muchi read book first fell love rooms magic thought maybe niece would like told story well knew real miniature rooms museum chicago last family chicago gone seen rooms niece even purchased lovely book describes rooms gives information createdthis book lot fun read say wasnt exciting books shes reading worth reading anywaywell loved story im waiting niece return copy read',
 'scots delight also scots blood running veins enjoy series historical fiction good deal steamy sex books could done withoutbut guess makes romance novel liked history mystery',
 'rhagehollywood good read loved rhage almost beautiful look fell madly love average looking woman took breath away fell love something would never change never grow old would always constant marys voice found could without would live without mary learned love rhage flaws also learned love return excellent',
 'amazing account highly recommend new view surviving childs memory reader feel hfear triumph love',
 'sad story gripping sad story end families war good insight details kommandant auschwitz captured wars end',
 'five stars thoroughly enjoyed book',
 'thriller nothing like many readers picked angels demons finishing da vinci code unfortunately two books similar made reading second one almost bore almostdespite blaring similarities da vinci code plot story suspenseful nonetheless turning pages anticipation like many good thrillers finished book knew resolution wish reading remember anything particularly striking book disappointing part dan brown uses exact plot format tdvc ad figured bad guy halfway bookthis great fun read nothing special even though similar feelings da vinci code found tdvc much interesting novel',
 'lacks multimedia kindle fire hdx good book problem multimedia didnt work kindle fire hdx much wish would',
 'titles book cooking pooh okay might seem immature writing thinking reason bought book title funny dont get think minute buying seem like good book children',
 'questions still although really enjoyed getting know beth wish understood could go normal woman something entirely different ok quickly went wondering wrath killer ok even love quickly felt fast fastwrath without doubt favorite part read book found interesting challenging understand like heroes description sounds wicked hot course keeps reading going blind pretty intense stuff shown mild limitation complete disability thought amazingi issues whole story line regarding mr x crew felt explored didnt care wish time cut half used explore wrath beth felt like didnt get see enough darius either looks change read series book   far get ready committedall really enjoyed book highly recommend',
 'five stars loved great characters story surprise ending',
 'three stars cannot comment book given daughter high school psychology sociology teacher',
 'awesome cant put book seriously treats question thats funny',
 'good story little fast paste good story characters loveable sweet love story recommended readers love romance',
 'well researched best lot work went book lacks objectivity proelvis though excessively seems entirely kind dr nick priscilla dr nick doubt curbed elviss overall drug consumption minimized careless peaks would occurred present however dr nick still party drug excess difficult believe interest primarily money expense caring elvis despite receiving hefty income found necessary borrow  volatile patient racquetball misadventure resulting legal estrangement still received better appreciation dr nicks effort reading book daunting task regards priscilla authors seem bought image trying project finstadts book priscilla presents plausible picture',
 'case reparations wow say tried something challenging time character shifts ties really pull end anyone doesnt like jealous writer',
 'nordic tale snow fell pleasant tale winter afternoon uncomplicated seemed quite honestly way young boy would feel life',
 'enjoyable read also learned lot enjoyable read also learned lot whitehouse administrations learned reading inside information general public never hears',
 'cant wait read sequels loved paranormal romance listened audio highly entertaining cant wait sequelscontent scene teenage sex way edit one scene would pass daughter',
 'helpful informative book diagnosed third stage kidney disease found book helpful technical went slowly understood would recommend one wanting understand disease try low protein diet',
 'plantation great summer read yanhif looking great summer beach book try plantation dorothea benton frank even better sullivans island great fun reading lowcountry tale family dysfunctional enough southern eccentric enough make laugh loud human enough make cry matriarch miss lavinia hoot torn cringing antics yet wanting like old age battles miss lavinias daughter mrs caroline wimbley levine perpetually pregnant lowlife sisterinlaw rage crown prince son drinks amidst gambling infidelity cast zany supporting characters ordinary story elements wealth marriages crisis parenting struggle independent balanced need go home however dorothea benton frank made characters come alive delightful way thomas wolfe proved wrong go home weeks searching perfect summer read finally found book chapter titled miss lavinia would like word read',
 'another good read mcdevitt little said simply good book enjoyable read nice plot line great character development come end simply happy came together',
 'number seven jack reacher number seven jack reacher another gripping well written book great story line typical reacher come resolve situation find love move hard put',
 'phenomenal book wow great story plot multi dimensional lots twists turns ending great cant wait next installment',
 'five stars nice',
 'conscientious parents please read review series appear marketed teens decided read book saw lot th th grade students reading books series story interesting however want caution parents  lot sex book people talking joking vampires people masturbating open multiple detailed love scenes main vampire human character alternate sexual relationships  also cursing  also pretty gruesome murders imagine students reading series parents didnt know elements story hope helps parents make wellinformed decision concerning appropriateness series child',
 'could keep interest got half way thru book became bored clich many books previously read',
 'five stars stephanie plum books funny',
 'four stars great wrap trilogy enjoyed three books',
 'compelling entertaining crime thrillers familiar sanfords prey virgil flowers books saturn run departure usual offerings science fiction story set near future  every bit compelling entertaining many crime thrillersthis realistic science fiction welldeveloped likable characters type story could maybe happen  although technical detail novel wellresearched characters politics race saturn outer spacescience fiction fans enjoy sanfords departure different genre nonscifi fans dont worry understanding technical jargon story good even dont know care nuclear reactor works loved hope brings back characters sequel',
 'five stars great read',
 'v funny difficult could lose virginity meanreally every tom dick harry readyheck downright eagerto help girl particular problem alas poor elliepoor untouched ellieseems cursed carry virgin shame adulthood isnt trying club hopping flirting outandout begging naught thus far time running ellie determined graduate college hymenfree along ridei loved virgin radhika sanghani wonderful story chock full situations humor girls relate appreciate ellie scream smart funny crying laughing hard times started listing theories see  v dilemma dominating existence fully invested quest get defloweredto nip virginity budso speakhailing southern united states unique set colloquialisms nice change pace read sometimes decipher brit speak peppered throughout storyif mood easytoread hilariously funny story things girls go try go wish hadnt gone virgin one shouldnt miss',
 'common vision love life forgiveness wonderful see many diverse artists expressing common vision love life forgiveness oneness commonality palpable may book help inspire many others follow artistic dreams live love forgive fully easily',
 'first time highlander series reader say hot hot hot read karen marie moning fever series loved series way different honestly doesnt even seem like author book fantasy nth degree cant say enough really enjoyed story line characters writing everything started kiss highlander know wont disappointed',
 'understanding larry legend twenty years written drive holds quite well definitive treatment larry bird told first personbut greatly helped journalistic skill bob ryanbird doesnt hide anything doesnt like something tells gets heart man wrong never forgets right hes loyal endthe early portion book deals poverty growing french lick yet bird saw blessed enjoying life older brothers heroes buddies grandmother stable influence quiet soul working multiple jobs mom courageous selfless glue heartbreaking section bird talks father fathers suicide bird the book picks birds time indiana state realizes good soon legend starts grow boston bird realizes play anyone remains french lick boston competitive drive relentless great',
 'great series books since nearly completed buying prey series one thing stands mr sandfords books including virgil flowers series adds dialog characters doesnt directly pertain main story talk two people subject matter incident focus overall plot great diversion often humorous line actually happens real life unlike fiction seems intent directed discussions crime investigation police characters abrupt opinions humor shoes one time life certainly understand themif market new series try different author john sandfords prey series great place indulge begin first book rules prey work way amazon listing lucas davenport prey series order isnt order pages displayed go wwwjohnsandfordcom get correct listing order',
 'book glimpse total complete internal destruction country people madman leading like b shirer correspondent berlin ww ii started privy workings german high command early days war united states involved met mingled officers german army access hitler giving speeches etc also traveled war zones stay including occupied paris book glimpse total complete internal destruction country people madman leading like blind sheep cliff also contains aspects shirers personal life mostly working life something longer available keen eyed reporter world events',
 'bad kudos ms stiefvater great book though obviously incredibly original plotline predictable places towards end nice twists writing solid descriptivemy complaint graces narrative seems aloof hard relate doesnt seem like much personalityi think youre going write romance novel give female characters traits anyone could relate make sure stand',
 'th grade ar th th grade discussion love book wellcrafted piece historical fiction makes reader witness whats happening town reader becomes judge jury hear multiple sides whats happening townthis book requires critical abstract thought every reader gets people dont speak literally every character good bad real life adults dont always speak plainlyits one favorite books teach written free verse lot white space pagethank ms hesses editors makes much easier students dyslexia reading disorders also available audio version vision dexterity issuesread book child discuss enjoy',
 'page turner hard time getting book got hurdle could hardly put',
 'eve first books still pretty good pretty good episode mystery series committee writers whoever actually writing much better job one although newer books series much thinner story first ones series obsession detective story one big inconsistency book description eves large diamond writer said roarke gave eve diamond first time said loved big error original writer series would known roarke told eve loved every day weeks would reply kind rejected gift diamond rather committing nearly broke good anyone actually written even read entire series would know errors like cannot convince writer wrote innocence death person wrote thankless death still read series like characters still aggravating voice stay consistent',
 'fun kids cute little story irish family celebrates st patricks day written style visit st nicholas twas night christmas clement c mooretwo children set leprechaun traps night st patricks day night morning found theyve caught one ask find hidden gold tricks wishes better luck next year super simple story bore young readers may exciting older kids might give ideas traps might like make younger sibling love siblings book treat usually holidays see kids fighting got best toy candy kids work together nicelywe also night easter author also cute dont really recommend getting treats probably better classroom setting multicultural multireligious backgrounds ideal use hoped anyway cute book fun story imo children age  could enjoy',
 'great always im always fan jd robb lt dallas book doesnt disappoint lots twist turns',
 'author deserves fame great american author although book published long recent discovery author well say one best books ive read  years thoroughly enjoyed epic required reading american school system',
 'review princess true story absolutely stunning storywe women west would never survived accepted programmed rules life women minguided muslim world freedom genders scarring deceit depicted novel belie authenticity female person sad sad sad',
 'insane goodness book amazing ordered separate one kids dont share mine relaxing meditative artist lineup top notch really better coloring book trust',
 'reads like novel ian tolls pacific crucible excellent work history book provides context events  makes story seem real almost report contemporary events toll incredible ability move action along manner novel exciting novel providing rich insights events people involved important players receive full eye fair portrayal example toll provides details churchills stay white house pearl harbor made vivid compelling story ever perceived learn participants virtues vices impacted war effort players discrete epidodes toll describes description action grippingif one reads one book beginning pacific war wellwritten entertaining us read many books war really enjoy',
 'enjoyed enjoyed book would tell sister friends book great romance book always love brenda jackson books makes feel book',
 'liked plot familiar almost quit reading thing held country traditions told end knew middle book predictable plot read',
 'three stars easy reading',
 'graphic style drawings coloring book graphic novel comic book style others seen backgrounds generally extremely simple probably appeal younger artists adults',
 'dolphins advocate foiled animal activists attempt free pair trained dolphins steve solomon finds morally obligated defend naive activist felony murder charge activists uncle da ray pincher conflicted calls former employee victoria lord prosecute solomon lord overcome professional personal conflict get truthall solomon vs lord novels show levines talent layering unfurling mystery yet adventure memorably different last satisfying returning fans winning new ones',
 'raritya welldone truecrime book kathryn casey ann rule texas shes veteran truecrime writer books readable wellresearched somehow missed one found pretty darn compelling lot legalese trial chapters thats standard major downside photos truecrime junkies must photos thanks ms casey',
 'right old big box store alley dont follow sanford third series character others lucas davenport prey series kidd pc jock artist con man knows genre well protagonist hippielookin kicker midwest uses aw shucks disarm suspects get bottom mysteries exactly guy keeps genre flirting danger disrespectful authority gets done brains little luck ar seven series  prey kidd   know check',
 'great book jazz musician mystery lover figured would right alley definitely moody writes flowing interesting style kept focused wanting put book aside got end recommend book highly look forward reading author',
 'roger best mostly chicagoan clearly remember days sneak previews wttw movies rest personally tended follow gene siskel rather roger ebert found eberts reviews occasion would include items really anything movie personal opinion enjoy tv contrasting reviews reviewers book said sections skipped early daysgrowing later toward end life first thought comments suspect midwest chicago area early days worth reading chapters toward endnot much yet fan either one worth reading probably amazing statement movie critic basically dropped lap another comment dont think really understood much influence popular culture really hadhey roger dont forget save aisle seat',
 'clsico es un excelente libro para aprender ejercitar la estrategia en ajedrez bien escrito ameno con comentarios pertinentes incluso con un toque de humor esperamos que pueda editarse en notacin algebraica',
 'interesting read read one back part longstanding series author stands alone novels seriesthe storyline simple enough figured story book ended still entertained would read books author',
 'love series like books one led care characters forgive one use county meant parish complaint assured rest book great',
 'must read book change life let',
 'five stars good reading',
 'unthinkable really shame fiction published masquerading truth disappointing wsj gave credibility distortion publishing organ donation real world respectful process wishes donor paramount respect donor donor family tremendous strict rules followed critical care community word harvest used harvesting done crops planted field specific purpose removing use elsewhere feel organs placed patients bodies purpose serving long full life tragic illness accident takes life away organs healthy enough transplanted save another life word recovery organs recovered donor transplanted something positive recovered loss want touched talk family person donated organs generally get different story portrayed sensationalist rag',
 'outstanding information learned much information never learned history classes excellent',
 'hard guess kept changing mind could done also liked poision conncection friend toxicologist',
 'five stars great book introducing different feeling experience facial expressions accompany',
 'dedicated reader griffen books dedicated reader griffen books one fastmoving continues saga osscia',
 'good read still reading book ehjoying',
 'big little lies great read liane incredibly good writer regarding thoughts emotions characters feel like know people reading books',
 'hangover  situation cant get worse sat plane giggled much alarm seatmates love farce',
 'best fiction ever read story katherine anchee min wonderful book browsing local library used work long ago stumbled upon read red azalea previously saw katherine becoming madame mao heart started race come looking books anchee min also picked becoming madame mao way new books display guess found wild gingergod bless anchee min brings america long expected true horrors red china characters evoke emotions amazing extent talking jasmine lion head specifically katherine truly excellent writerwell books ever written red azalea cant wait come nextthanks much hope enjoyed wonderful books much look biography review coming next',
 'brilliant masterful book highly recommended brilliant book jam packed information never seen books read reread effort absorb incorporate principles art rare masterpiece every serious artists library information composition simply stated yet mindblowing newly published books composition bought cant compete treasure rest book equally profound',
 'five stars beautifully writtenthis part one jared alys incredible story love doesnt fade heal great thing al jackson stingy words read one books getting full complete story incredible character development first al jackson read knew halfway hooked funny story started book friday night stayed entire night finish am finally finished problem second book way could adult without finishing told husband awful migraine stayed bed instead resting bought book  finished never life book captured like series knew loved read reading wrong books',
 'good book best author would still recommend like work would give  stars dont understand people say doesnt seem like julie garwood book good older books historical romances read better books genre yes still plot books series fbi agent younger girl smart beautiful plot revolves around fbi agent saving female girl gets shot atexplosionsetc said still fun like said liked ones better still liked book like author',
 'must read russian far east book brought back wonderful memories wonderful friendly people living spite harsh climate',
 'exciting thriller spellbound plot suspense believability book one best ive read year want read series',
 'great thinking love pleasure reading reading one books quoted book dante alighieri thought would try love thinking invlved book amazing way dig deep relate relate nice able read literature share',
 'excellent story treat even second time around ms delacroix captured battles banter husband wife deftly',
 'another home runtouchdown burton another home runtouchdown burton wow book hot hot hot hot budding relationship trevor amazing watch easy read secret finally comes amazing love serious issue woven book loved games trevor played weather baseball football described made want watch games right along reading want meet characters someone could put jaci burton play play book could love become friends wives girlfriends teammates well would oh happy grief felt intense took time finally work love trevor stepped right helped perfect man tough loving honest yes even secret sexy talentedjust perfect',
 'toughest toughest navy seals considered toughest combat troops us forces corpsmen toughest toughest story young mans search significance adventure played crucible vietnam young men like mr mcpartlin americas greatest asset men modern age cynicism distrust nothing short amazing legendary col john boyd fond reminding us machines dont fight wars people use minds upon reading book one would add toughest use heartin book mr mcpartlin takes us heart warrior code along way applies code passion battle passion lives fellow warriors well enemy combatants finds care epitomy american warrior hero mr mcpartlin represents kind american warrior wish could courageous tough yet without loss compassion toughest tough willing care face hopeless impossible situations start journey mr mcpartlin want carry end',
 'four stars easy read ending left things hanging like loose ends tied without waiting f',
 'good read predictable good read predictable yes situations nevertheless easy reading good right prevail',
 'liked book enough buy sequels well written liked book enough buy sequels well written',
 'somewhere safe like letter father tim newest friend reader  years since first mitford novel published jan karon developed perfect recipe delicious storytelling ive enjoyed story whenever need minivacation stories always warm uplifting enough suspense hook fans cant help become invested many colorful multigenerational characters  footed  footed father tims love mention love gardening love food somewhere safe like letter father tim newest friend reader adventure picks last adventure ended always good hear father timi issue editing felt constant reminders characters review previous stories bit cumbersome maybe would best addressed new fan first single chapter preface feel characters beloved written stories cumbersome story try refer longer part main story referring deceased characters introduction new characters diluted overall storyline making difficult follow also story written perspective different characters father tim written said found difficulty following along',
 'dont read even love nicolas sparks disappointing read book like nicholas sparks book didnt anything keep attention kept reading hopes would get better like books nicholas sparks unlike wedding emotion evoked book except disgust behavior female lead lexie close end story everyday life engaged couple barely knew miscommunications towards end crisis seemed contrived read wanted find happened cared characters point waste free time',
 'velma jean wonderful character hope author writes sequel velma jean wonderful character',
 'new rear window liked book surprises twists plot good summer read',
 'excellent text brilliantly composed immense value therapy student even book required reading course study read anyway trust got many psyc texts',
 'liked lot interesting reading liked lot',
 'best gray man yet love action book full pure entertainment escapism best looking thought provoking gentle reading look way book face nonstop action bad guys love hate hero superhuman plot side belieavable great read book disappoint cant wait next',
 'great books great set books granddaughter loves reading lets know finished needs',
 'five stars well done',
 'four stars good read however surprises writing liked interaction three women',
 'predictable reading first three books series hoping dan brown would find different way keep us amused edified hasnt book predictable choc full false suspense',
 'quick read quick read thoroughly enjoy pick sunday afternoon read good flow story',
 'jr ward like others awaited chapter black dagger brotherhood great anticipation finally see happens character jr ward introduced first book bringing together episodes three novels series far wild ridejr wards writing fast paced action furious hot adept writing intimate interactions characters form glue holds plots subplots togethershe developed wonderful mythos around characters unique writing adventure romance actionfor like romance real edge series youdark lover black dagger brotherhood book lover eternal black dagger brotherhood book lover awakened black dagger brotherhood book ',
 'great book great book lot info regarding emotional development children also lot tips parents teachers really enjoyed reading book',
 'many questions answered wonderful trilogy answers many questions stokers novel questions  females tormented harker transylvania abraham van helsing obsessed destruction dracula hadnt peasants destroyed years glorious prequel kept style tone stokers deeply enriched dracula legend',
 'attack one night stand book truly funny heroine dumped goes trip africa hilarious friends meets drop dead gorgeous ranger spends night withokay fish water story ranger gets ny drives everyone batty hes element plots little neatly tied way storys told authors humor makes terific read big romance thats youre looking laughed loud read felt characters memorable author fav list',
 'like others kay scarpetta case searching party responsible multiple murders time appears lovers turning dead daughter newest drug czar goes missing along boyfriend feared turn murdered seriel killer like others story several twists turns including conspiracy cover theory decent enjoyable read',
 'awesome book awesome friends laugh thirty chapters books highly recommend jimmy age ',
 'ice limit intereting book good got feeling going turn series dont think would another book subject',
 'five stars every thing advertised',
 'didnt pull like others okay took awhile finish didnt pull like others',
 'five stars love stone barrington books quick reading captivating',
 'book practically reads iself cant wait read complete trilogy shaman find happens next great writing',
 'nice effort good plot twists liked book especially read afterword think neat named many characters reallife friends familyher plot enjoyable variety twists turns believable others characters welldeveloped liked way drew background small chunks believable ways long periods introspection seemed contrived long chunks info dump either boring distracting bothall really enjoyed likely read follow books dd warren future',
 'discovering new protagonist wonderful read first experience special agent pendergast much done plot line although enough twists story keep edge rather brought early antagonists wonder end pendergast nice cross holmes jack reacher plan jump right second book helen series back track earlier stories',
 'intelligent thought provoking relevant books could good submission well written tightly paced novel completely engages reader important addition growing body literature explores effects  reveals tragedy continues affect american psyche waldman leaves stone unturned creating diverse realistic cast characters represent current american scene bounty discussable topics include religion politics art make ideal book groups one favorites year novel take great pleasure recommending',
 'two stars poorest john sanford book read read',
 'magverse please glad read another story magverse series would love read world created two stories also good always look forward read books author auto buy',
 'good read really like whole series',
 'speaking truth power bit outofdate days im still glad read book im glad al franken good recognizing remembering bullies hope hell able write future',
 'tough going found book tedious slog commend finishing live expectations',
 'great read friends tried months get read paranormal romance avail finally tried couple books still unconvinced one great read hard time putting cant wait read next series highly recommended',
 'insight third branch government toobins book provides insight supreme court third branch government readers learn justices work involves simply literally interpreting laws constitution conservative backlash built since roe v wade agenda reverse roe v wade expand executive power end racial preferences speed executions welcome religion public sphere eg prayer schoolstoobin convinced abortion rights central issue todays court two kinds cases abortion rights others certainly topic become litmus test considered justices nine takes us several important decisions bush v gore efforts refinetrim roe v wade university michigans approach giving minority preferences etc well providing background justices given oconnor became swing vote many cases presentations often shaped specifically appeal herfinally another topic given close review nine thinking went selecting sometimes rejecting various nominees seat court',
 'one tough sob agrivating female cohort jumping world chasing rag heads presumably deadly wmd back stabbing official washington dc everything bring credit including murder pace action realistic story rough needs polishing',
 'five stars great',
 'easy read nice easy read wasnt expecting epic saga causal read pass time great',
 'great book book classic laurell hamilton ride roller coaster emotions suppressing monsters good plot line im sure recycling past character always good thing works book looking forward commitment ceremony',
 'give drood every book series fun exciting well written creative great characters fantastic premise devoured day love greens universes slowly becoming interconnected great ending cant wait next one audiobook',
 'depressing dragon drama story happy joy filled romance lead male slowly going blind childhood infection carries lot emotional baggage lead female falls love father forces marry threatening employment knowledge fathers machinations find pregnant never wanted children couple married duress baby way course assistant wants dragyoudown dramabelievable characters depressing dialogue story linei reread story look forward works author',
 'fast pace good read difficult put',
 'book thief first impressed book read liked hard put book',
 'great buy niece always took care technical problems phone computer well went college proud times ive run question accomplish step iphone book walked solution niece came home visit impressed knowledgeable auntie become ps mac dummies equally good',
 'one favorite books since ive read book yet keep remembering scenes characters ive purchased several copies years give friends',
 'worth time book mitchells first novel said cohesive tightly written book series linked first person narratives told nine characters nine characters range terrorist cult member favoriteto late night deejay various nationalities various stages life short inhabit global village characters remarkably rendered intertwined experiences unique yet interrelated final chapter brings shared fate challenging work obviosly book every reader reviewers compare author favorably delillo murakami struggle wdelillo love murakami dont see connection especially murakami reference points global literary village comparisons helpful look forward next effort',
 'fantastic effort simply superb writing haunting mesmerising yet touching downloaded next hollinghurst kindle',
 'cup urban fantasy romance selected book based amazon recommendations supposed urban fantasy faeries strong female protagonisti got bad vibe beginning book main character described enormous breasts didnt much improve setting actually quite interesting faerie otherworld shown parallel world reachable newly activated portals creatures cohabitating world vampires things popped back ms galenorn takes pretty much mythos get hands finnish japanese celtic tosses pot mixes well kind like aspect book despite overthetop fashion itwhat completely ruins book two things first really needed harlequin brand second protagonist insufferable keeps contradicting arrogant bossy hypocritical thing acts animal instinct gets upset people call could kind endearing makes dislike intently knowing shes sexy dont forget except things describes actually sound attractive',
 'entertaining educational american author debra olliver sat kitchen table mentioned lived france ten years french husband exuberance asked tell french women coffee french women know would transcript resulting conversation lengthy conversational essay olliver expounds every personal anecdote pop culture reference quote muster relating french womana woman boasts internationally acclaimed allurewhat american woman learn  pages addressing secrets french approach life love definitely sex youd like know first must concede average french woman simply sexier self assured sophisticated average american woman accept premise smartly explores french culture produces distinct woman woman captivates world french women know enjoyable read book ive read portray edith wharton right catty',
 'reapers stand joanna wylde enjoyed books serieshowever reapers stand favoritei absolutely loved picnic book alleverything need keep late night cant put could id give  stars would',
 'great time waster keep busy long time metitative great time waster',
 'tom f mike excellent job recent demise american across many landscapes based upon fact opinion book great job tracing happened hopefully done get us back course meantime hedge bets start learning mandarin',
 'exclusive riding camp december selection womens book club highly touted national publications disappointed book usually fall love one characters didnt connect characters story',
 'abomination first novel read author last concept interesting lots people experience meddling homeowners associations execution dreadful horror satire doesnt work awful awful stuff main characters unbelievably foolish novel completely contrived works level avoid costs',
 'bring jasmine toguchi summer loved character fun fall love character know three books coming short order jasmine funny interested world great young readers graduating picture books fun family read aloud totally recommend',
 'excellent reading read centenial many many years ago decided read one favorite books micheners would recommend anyone wanting experience throught reading settling american west',
 'joe rides one best especially two areas action unerring sense place joe faced seems like pretty straightforward murder time boxs evil fed epa guy lengths guy go get supposed murderer may seem farfetched paranoid boxs readers eat loved action nearly saddle sores time finished could put one mr box please',
 'best half way dont know finishshe favorite writer one hold interest',
 'beautiful novel found book recommended amazon based upon previous books purchased usually find liking books amazon flags books would like time set across time novel beautiful hopeful tragic sad time instantly fell love characters cared young girls search love family moved old man best friend downstairs read shadow wind familiar books book within book format bit confusing novel falls bit short  starts feel ending rushed author wasnt sure exactly wanted therefore ended came across bit muddled beautiful novel stay bookshelf intend circulate amongst friends family',
 'another good john corey story expected demille good job putting john corey tight situation story interesting thought ended rather abruptly hope corey goes back cop hes best',
 'five stars excellent service excellent product described thank',
 'promises death crazy book many twists turns think know throw curve',
 'four stars earlier books great',
 'fantastic book wonderful book readers young adults surprising full historical detail cannot recommend highly enoughalison c vesely artistic directorfirst folio theatre',
 'nice see continued development jack ryan jr nice see continued development jack ryan jr related characters developed noted time future world events',
 'fast moving easy read interesting story encompassing many social issues things figured author got overall engrossed see worked liked',
 'interesting interesting account civil war years greatly detailed informative',
 'steve jobs awesome book loved reading loved love reading computers great book',
 'fun color book fun color would recommend book',
 'easy read learn book makes wine approachable much easier understand super engaging easy read grateful author fantastic resource',
 'winning debut novel drums opens meloncrunching tada drum enthusiast sam whallops loudmouth danny head marimba mallet easy girl lives breathes drums middle school especially stressed parents arent supporting shes told band friend shes really good sam decides risk even trouble borrows family lawnmower earn extra income pay private drum lessons neighbor drum guru pete without asking parents tutilege sam progresses fast pace even homelife disintergrates school misdemenors finally catch shes brink something big shes suddenly facing life without drums without courage fortitude sam doesnt stand chance continuing hear musicthe author outstanding job creating believable middle school environment characters sam main character easy root drums touching contemporary read grades ',
 'towering inferno feel book good read thrilling di vinci code angles demons underlying story line one presents wealth questioning feel intense quality others could also tainted fact two made splashy moviesi enjoy traveling florence venice book two favorite cities visitwould recommend yes',
 'mattie enjoyed book would liked better married life went older version',
 'among best socal mysteries southern california mystery writer first book initial release remain greatly impressed devil blue dress walter mosley one major talents crime fiction field debut mystery terrific book easy rawlings landmark character mosley handles postwar los angeles setting better author recall book could written raymond chandler focused africanamerican protagonist book deals california themes well ross macdonald best devil blue dress contemporary classic',
 'want read good book get great writer want read good book get great writer',
 'ahh romance another wonderful love story favorite author takes place favorite texas town promise love following people lives loves',
 'follow evidence dont doze might lose track whos story builds courtroom shootout like plots past surfaces twist events present left wanting character depth',
 'empty easy read enjoyable keeps toes total chick book fast pace attention grabbing adventure challenge requisite romance thrown',
 'wonderful comprehensive book homeschooling reading dozens books homeschooling honestly say best one found market homeschooling four children ages     first time year previously rigorous collegeprep private school looking way give similar education home book wellorganized inspiring helpful agree presented book although would everything authors didwith large family make changes new baby due january find able feel important book given us courage make huge change lives',
 'incredible far best book series questions answered ways never could expected loved',
 'good rad strong heroine desire freedom despite oppressive father society hero admits weaknesses good balogh read',
 'jim dale delivers usual incredible talent harry potter deathly hallows outstanding book jim dales meticulous reading makes story better wont regret listening book jims ability create voices character maintain consistently throughout book amazing voices chooses character help picture characters ive never moment felt representing character way wasnt consistent story fantastic',
 'wow gotta keep mind going many layers story cant let go stop reading lose track fascinating absolutely love shouldnt problems making movies',
 'cowboy romance predictable lovely ride love books characters woven throughout previous books like return old friends good read',
 'favorite final book chesapeake series well worth wait wish nora roberts could keep series going forever ever seeing seth grown amazing get  books read order want start soon finished',
 'five stars expected',
 'another good one didnt get  stars entirely long continuation risas stupidity came lucian finally eyes fully opened took way long considering suspicions even liking train thought came realize lucian likely put compulsion spell couldnt resist likened essentially using date rape drug excusing insult women azriel needs step plate love even though frustrated risa whole oh doom get together bs keeps spouting tiresome cant wait next installment time key searched better lose one',
 'inspirational teens denmark occupied nazis wwii citizens seemed roll allow take everything knud petersen friends however would stand started small group saboteurs city odense continued aalborg family moved except brother also helped others family knew day betrayed arrested knud brother served longest sentences arrested conspirators survived prison time serving time others country finally picked torch harassing germans boys courage inspiring shaming actionphilip hoose inspired tell story read small exhibit resistance museum copenhagen time knud till alive could hear entire story person exchange s email questions answers great look power teens inspire entire countryms hs essential',
 'good stuff good book smart guy',
 'great loved book great wait series writer really brings book greatly written',
 'five stars great story line',
 'tough times us bomber crews europe highly capable engaging writer robert mrazek really done homework regarding costly unsuccessful usaaf raid stuttgart germany sept   rather early us bombing campaign europe us air chiefs like combat fliers still learning ropes mission examined book one several showed starkly dreadful cost sending bomber formations deeppenetration raids especially germany without fighter escort mrazek follows six b crewmen officers enlisted preparation execution survived aftermath raid riveting superbly told story tragically underacknowledged mission interesting bonus epilogue telling became survivors later years',
 'different rest one felt little different rest series first one read series think bit mistake time idea series related book',
 ...]

Map our text to word tonkenizer

In [ ]:
map(word_tokenize, texts)
Out[ ]:
<map at 0x7f0fe75890d0>

Create a new column called POS_text.

We will save our result into this column.

In [ ]:
model_data_raw['POS_text'] = pos_tag_sents( model_data_raw['text'].apply(word_tokenize).tolist())

Now, we delete all useless columns we aleady know from Q3.

In [ ]:
model_data_raw = model_data_raw.drop(['style', 'token_text'],axis=1)
In [ ]:
model_data_raw.head()
Out[ ]:
verified text score POS_text
744031 1 sigh miss claire jamie vague characters wasted... 1 [(sigh, JJ), (miss, NN), (claire, NN), (jamie,...
801184 1 remarkable book many levels thoroughly enjoyed... 5 [(remarkable, JJ), (book, NN), (many, JJ), (le...
341256 0 buy book laurel hardy fans love book certainly... 5 [(buy, VB), (book, NN), (laurel, NN), (hardy, ...
734969 0 enthralling suspense readers familiar ms coult... 4 [(enthralling, VBG), (suspense, NN), (readers,...
750145 0 great story fantastic illustrations cupidandps... 5 [(great, JJ), (story, NN), (fantastic, JJ), (i...
In [ ]:
model_data_raw.to_csv('Pos_dataset.csv')
In [ ]:
joblib.dump(model_data_raw, 'pos_data_raw.pkl')
Out[ ]:
['pos_data_raw.pkl']

4.1.3 Get TF-IDF weighted vector

Now it's time to extract the nouns only words to obtain a bag-of-words tf-idf weighted vector.

First, we create a new column called "only_noun". We will get all NN type of words from the column POS_text and save those into our new column.

In [ ]:
# reload the dataset for more easy to access
model_data_raw = pd.read_csv('/content/drive/MyDrive/A3/Pos_dataset.csv')
model_data_raw = model_data_raw.drop('Unnamed: 0', axis=1) # old index is useless now. drop it
model_data_raw = model_data_raw.convert_dtypes()
model_data_raw = model_data_raw.fillna('empty')
model_data_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   verified  100004 non-null  Int64 
 1   text      100004 non-null  string
 2   score     100004 non-null  Int64 
 3   POS_text  100004 non-null  string
dtypes: Int64(2), string(2)
memory usage: 3.2 MB
In [ ]:
# create a new column called only_noun with empty values
model_data_raw['only_noun'] = pd.NA
In [ ]:
model_data_raw.head()
Out[ ]:
verified text score POS_text only_noun
0 1 sigh miss claire jamie vague characters wasted... 1 [('sigh', 'JJ'), ('miss', 'NN'), ('claire', 'N... <NA>
1 1 remarkable book many levels thoroughly enjoyed... 5 [('remarkable', 'JJ'), ('book', 'NN'), ('many'... <NA>
2 0 buy book laurel hardy fans love book certainly... 5 [('buy', 'VB'), ('book', 'NN'), ('laurel', 'NN... <NA>
3 0 enthralling suspense readers familiar ms coult... 4 [('enthralling', 'VBG'), ('suspense', 'NN'), (... <NA>
4 0 great story fantastic illustrations cupidandps... 5 [('great', 'JJ'), ('story', 'NN'), ('fantastic... <NA>

Note: We know that feature POS_text has index 3 and only_noun has index 4. We will use this information for the next step operation.

Now, we write a for loop to iterate all instances and get the Noun by rows then save it to only_noun feature.

In [ ]:
# use for loop to get all pos_list
from tqdm.notebook import tqdm
for i in tqdm(range(len(model_data_raw))):
  # get one instances from POS_text by iloc with i, 3
  text_list = model_data_raw.iloc[i, 3 ]
  # get a empty set for saving noun words
  nouns = []
  # iterate over the text_list
  for words, wtype in text_list:
    if wtype == 'NN': # if it's a noun
      nouns.append(words) # add words to our noun array
  
  # end for
  # we save our nouns extraction to our new column 
  # first we convert our list to string then save the string back to the dataset.
  str1 = " "
  nouns_str = str1.join(nouns)
  model_data_raw.iloc[i, 4 ] = nouns_str

Save our results to a new dataframe

In [ ]:
data_only_noun = model_data_raw.copy()
In [ ]:
data_only_noun.to_csv('data_only_noun_data.csv')

We reset our new dataset's index.

In [ ]:
data_only_noun = data_only_noun.reset_index()
data_only_noun = data_only_noun.drop('index',axis=1)

Print the new dataframe's head

In [ ]:
data_only_noun.head()
Out[ ]:
verified text score POS_text only_noun
0 1 sigh miss claire jamie vague characters wasted... 1 [(sigh, JJ), (miss, NN), (claire, NN), (jamie,... miss claire jamie book summer holiday dollar s...
1 1 remarkable book many levels thoroughly enjoyed... 5 [(remarkable, JJ), (book, NN), (many, JJ), (le... book book boy title year catalyst book obsessi...
2 0 buy book laurel hardy fans love book certainly... 5 [(buy, VB), (book, NN), (laurel, NN), (hardy, ... book laurel book affection john mccabe ollie p...
3 0 enthralling suspense readers familiar ms coult... 4 [(enthralling, VBG), (suspense, NN), (readers,... suspense couple sherlock husband wife team rea...
4 0 great story fantastic illustrations cupidandps... 5 [(great, JJ), (story, NN), (fantastic, JJ), (i... story story heroine j lynchs watercolor book t...

Now we can see in only_noun column, there indeed only nouns.

It's time to get the TF-IDF matrix

In [ ]:
#----------------TDIDF_Data_generator---------------
def TDIDF_Data_generator_pos(data, max_features = 500, feature_name='only_noun'):
  from sklearn.feature_extraction.text import TfidfVectorizer
  from sklearn.feature_extraction.text import TfidfTransformer
  # we set for max_features: as most words we saved for 500 words
  # in order to not exceed our memory
  v_test = TfidfVectorizer(stop_words='english', max_features = max_features)
  # get TDIDF array from token_text
  x_token_text = v_test.fit_transform(data[feature_name])
  # save it into a pandas dataframe
  tdidf_data = pd.DataFrame(x_token_text.toarray(), columns = v_test.get_feature_names() )

  data_copy = data.copy()
  data_copy = data_copy.reset_index() # need to reset index for matching the results

  tdidf_data['verified'] = data_copy['verified']
  tdidf_data['score'] = data_copy['score']
  return tdidf_data
In [ ]:
data_only_noun = pd.read_csv('/content/drive/MyDrive/A3/data_only_noun_data.csv')
data_only_noun = data_only_noun.drop('Unnamed: 0', axis=1) # old index is useless now. drop it
data_only_noun = data_only_noun.convert_dtypes()
data_only_noun = data_only_noun.fillna('0')
data_only_noun.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   verified   100004 non-null  Int64 
 1   text       100004 non-null  string
 2   score      100004 non-null  Int64 
 3   POS_text   100004 non-null  string
 4   only_noun  100004 non-null  string
dtypes: Int64(2), string(3)
memory usage: 4.0 MB
In [ ]:
only_noun_tdidf_data = TDIDF_Data_generator_pos(data_only_noun)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
In [ ]:
only_noun_tdidf_data.head()
Out[ ]:
ability account action addition admit adult adventure advice age air ... wow writer writing year york youll youre youve verified score
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 1 1
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.499273 0.0 0.0 0.0 0.0 1 5
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0 5
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0 4
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0 5

5 rows Ă— 502 columns

In [ ]:
only_noun_tdidf_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Columns: 502 entries, ability to score
dtypes: Int64(2), float64(500)
memory usage: 383.2 MB

4.2. Repeat question Q3

Firse we get our training, validation, and test set

In [ ]:
X_raw, y_raw = get_model_set(only_noun_tdidf_data)
X_train,X_test, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.2, random_state=42, stratify=y_raw)
X_train,X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

4.2.1 Feature selection

Then we reuse our feature selection function from Q3.

Save feature names.

In [ ]:
# save feature names
features_name = X_train.loc[:,:].columns.tolist();
len(features_name)
Out[ ]:
501
In [ ]:
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

fscores = select_features_prompt(X_train, y_train,X_test, f_classif)
Feature 0  ability: 1.484523353924719
Feature 1  account: 1.0113771497221016
Feature 2  action: 11.783286846512594
Feature 3  addition: 4.746295143419053
Feature 4  admit: 6.854024986883262
Feature 5  adult: 1.110567952059355
Feature 6  adventure: 10.688372847999839
Feature 7  advice: 5.133501014113466
Feature 8  age: 0.1570575401148864
Feature 9  air: 1.6799802266745127
Feature 10  amazon: 5.237110090455931
Feature 11  analysis: 1.120373376362189
Feature 12  approach: 1.82041230004955
Feature 13  area: 2.641287618212717
Feature 14  art: 0.45349177589444506
Feature 15  attention: 4.7586168974236625
Feature 16  audience: 15.40774957570686
Feature 17  author: 20.67817862386069
Feature 18  baby: 0.5375785989191142
Feature 19  background: 5.443685813549299
Feature 20  battle: 3.4514230327406876
Feature 21  beach: 17.901472759527405
Feature 22  beauty: 1.1696797004329305
Feature 23  biography: 1.3217624398142493
Feature 24  bit: 147.58788356965488
Feature 25  blood: 1.8095629306087382
Feature 26  board: 0.8672361723967394
Feature 27  body: 0.710524616866586
Feature 28  book: 33.16835879004613
Feature 29  box: 0.9618562607707842
Feature 30  boy: 0.17379465984610287
Feature 31  brain: 3.5160447848478387
Feature 32  break: 1.2717468903925608
Feature 33  brother: 2.0734313806985623
Feature 34  building: 3.4169463131885562
Feature 35  business: 0.5233560392685315
Feature 36  buy: 7.104682042060348
Feature 37  buying: 7.851504842316437
Feature 38  car: 2.436781586950014
Feature 39  care: 87.1748665327729
Feature 40  career: 1.1710427560321748
Feature 41  case: 20.9497061789894
Feature 42  cast: 0.7211708987770752
Feature 43  cat: 0.9647324113030373
Feature 44  century: 3.6015624365333485
Feature 45  chance: 2.496032655227362
Feature 46  change: 1.9367379297385838
Feature 47  chapter: 25.594336068050445
Feature 48  character: 89.43345373804378
Feature 49  check: 2.1736083566646633
Feature 50  chemistry: 1.9736279825922531
Feature 51  child: 1.3958569623061379
Feature 52  childhood: 1.671971014702444
Feature 53  choice: 1.7014701229281297
Feature 54  christmas: 4.815861975883245
Feature 55  church: 1.7799495886836865
Feature 56  city: 4.765346609430831
Feature 57  clancy: 1.1811591657657707
Feature 58  class: 0.5887726986968728
Feature 59  clever: 0.4483815953893954
Feature 60  club: 7.8621927131582385
Feature 61  collection: 1.1438300068731533
Feature 62  college: 2.8707834944162998
Feature 63  color: 1.649055075450117
Feature 64  come: 1.9349169810880378
Feature 65  community: 0.44632336721367283
Feature 66  company: 0.5925140224590132
Feature 67  complaint: 9.147308878655876
Feature 68  concept: 11.15035499100627
Feature 69  conclusion: 13.309815058739572
Feature 70  condition: 13.574800019129677
Feature 71  connection: 6.401496212102634
Feature 72  content: 16.434381361385316
Feature 73  continuation: 3.2631594746434747
Feature 74  control: 2.158794019331104
Feature 75  cook: 1.6746816314038961
Feature 76  cookbook: 2.98505363892507
Feature 77  copy: 4.43913555648253
Feature 78  country: 3.004912002058977
Feature 79  couple: 9.188036252531427
Feature 80  course: 2.7563155082423156
Feature 81  cover: 1.9638872939893832
Feature 82  cozy: 4.551551503863827
Feature 83  crime: 3.0513939950239877
Feature 84  culture: 2.4828077146724516
Feature 85  cussler: 2.2502069876297477
Feature 86  cute: 3.0494126811898346
Feature 87  dan: 2.5895535034560346
Feature 88  danger: 0.7661118985988636
Feature 89  dark: 6.3075948801837995
Feature 90  date: 1.9919491245401824
Feature 91  daughter: 6.965705872600137
Feature 92  davenport: 0.5339527791080856
Feature 93  david: 1.9720955581101944
Feature 94  day: 4.49779431640777
Feature 95  deal: 1.9141609483727822
Feature 96  death: 0.8722276930476675
Feature 97  depth: 11.997947476015398
Feature 98  description: 19.430782988035734
Feature 99  development: 15.045857281400433
Feature 100  dialogue: 39.06520005645369
Feature 101  didnt: 201.84269110610134
Feature 102  difference: 0.3707254913754456
Feature 103  disappoint: 15.153438372803732
Feature 104  disappointment: 189.60253635412613
Feature 105  discussion: 2.897260690667802
Feature 106  doctor: 0.49408733061986804
Feature 107  doesnt: 37.49934960452486
Feature 108  dog: 0.6529804403607607
Feature 109  dont: 98.36842544050137
Feature 110  doubt: 7.900660288307749
Feature 111  dr: 2.232440801293146
Feature 112  drama: 0.8058618479130443
Feature 113  dream: 0.284595701205597
Feature 114  earth: 0.6787919130552243
Feature 115  edge: 3.5822315352446252
Feature 116  edition: 7.777215927440611
Feature 117  editor: 55.70788384026035
Feature 118  education: 1.4571185940162903
Feature 119  effort: 40.83816883855241
Feature 120  emotion: 3.6474937095569464
Feature 121  end: 11.131608565203212
Feature 122  enjoy: 11.188934448083158
Feature 123  entertainment: 3.6511562363092325
Feature 124  era: 1.8281503546296984
Feature 125  escape: 2.079007168367492
Feature 126  event: 1.9213780557784068
Feature 127  example: 14.534062463699835
Feature 128  excellent: 15.643466633202488
Feature 129  exception: 2.0498759488609113
Feature 130  exchange: 9.149399657515463
Feature 131  experience: 3.166429137870083
Feature 132  explanation: 5.727862000668472
Feature 133  eye: 3.532584022258212
Feature 134  face: 0.4263672566869334
Feature 135  fact: 13.437951413857407
Feature 136  faith: 1.749040686390349
Feature 137  fall: 0.4956699073764301
Feature 138  family: 19.170498510334212
Feature 139  fan: 11.722613707348524
Feature 140  fantasy: 0.43605778665636974
Feature 141  fashion: 2.339373207907095
Feature 142  father: 0.2136811188959289
Feature 143  favorite: 3.781996810410293
Feature 144  feel: 11.300906943827485
Feature 145  feeling: 2.9724175605965257
Feature 146  fiction: 3.702962541267234
Feature 147  field: 2.2128977636633254
Feature 148  figure: 13.005891401902185
Feature 149  film: 0.6548528905262022
Feature 150  finish: 50.492300181830146
Feature 151  focus: 19.314333381461438
Feature 152  food: 2.3647364431348614
Feature 153  force: 2.4754375601316525
Feature 154  form: 6.31509561997899
Feature 155  friend: 2.1867167881497434
Feature 156  friendship: 1.8212652166082175
Feature 157  fun: 40.04761411309636
Feature 158  future: 1.8152320505797896
Feature 159  game: 1.5538834976654152
Feature 160  genre: 2.898447414745322
Feature 161  george: 0.09708401705803704
Feature 162  ghost: 1.6875233631946576
Feature 163  gideon: 1.8682169512856495
Feature 164  gift: 17.596031049670923
Feature 165  girl: 9.60756104239433
Feature 166  glad: 1.9088357648968661
Feature 167  god: 2.1383440047868225
Feature 168  gold: 0.14959952873175672
Feature 169  government: 0.8997508475110094
Feature 170  grace: 1.0760014338702186
Feature 171  grade: 1.7560243273795417
Feature 172  granddaughter: 11.215489882130775
Feature 173  grandson: 9.139463197262403
Feature 174  grisham: 2.975947954037313
Feature 175  group: 3.9491078202986634
Feature 176  growth: 1.3028917320058722
Feature 177  guess: 60.25009929400542
Feature 178  guide: 0.7124451787189396
Feature 179  guy: 16.136536717460704
Feature 180  half: 75.3431205932986
Feature 181  hand: 2.818615885879935
Feature 182  harris: 1.6800704891592069
Feature 183  harry: 5.872877927410753
Feature 184  hate: 17.949336057787853
Feature 185  havent: 1.2143568772765128
Feature 186  head: 1.6721105566003271
Feature 187  health: 0.5335304486420903
Feature 188  heard: 0.5493108389486383
Feature 189  heart: 19.24025521825326
Feature 190  hell: 1.1939123673023369
Feature 191  help: 0.3774529602308114
Feature 192  hero: 11.70868913685098
Feature 193  heroine: 25.956566421041586
Feature 194  history: 6.605481139311091
Feature 195  hit: 6.342066183391154
Feature 196  home: 3.1370783919803116
Feature 197  honest: 2.255752188905189
Feature 198  hope: 4.966973040207056
Feature 199  horror: 0.956190764631228
Feature 200  house: 3.4454545932071956
Feature 201  humor: 1.6181502345999637
Feature 202  hunter: 0.28703027351413535
Feature 203  husband: 0.8306981737592558
Feature 204  id: 9.988228781328191
Feature 205  idea: 54.49616661724381
Feature 206  ill: 5.620159080417381
Feature 207  im: 14.569552980543072
Feature 208  imagination: 1.0472986123040564
Feature 209  imagine: 3.008196469814513
Feature 210  information: 4.653596801188068
Feature 211  insight: 1.8881840563283456
Feature 212  installment: 3.436213034615245
Feature 213  interesting: 10.461627502756409
Feature 214  intrigue: 2.9960106805358633
Feature 215  introduction: 3.2867903316936147
Feature 216  island: 0.7824594915078237
Feature 217  isnt: 27.41365150468441
Feature 218  issue: 12.380160983055172
Feature 219  ive: 1.4194272734935054
Feature 220  jack: 0.7917412238296195
Feature 221  jackson: 3.1316052937209373
Feature 222  jane: 0.9142741797296041
Feature 223  jesus: 3.512687920778627
Feature 224  job: 16.461462653195717
Feature 225  joe: 4.950334881054976
Feature 226  john: 0.1749571891968384
Feature 227  journey: 5.238448209122317
Feature 228  joy: 6.409718219987695
Feature 229  justice: 0.5163503781777901
Feature 230  kid: 0.05735559217851594
Feature 231  killer: 5.165007462567746
Feature 232  kind: 54.916785804928686
Feature 233  kindle: 3.0025874138956303
Feature 234  knowledge: 1.0566774495401527
Feature 235  lack: 27.62878191248375
Feature 236  lady: 1.076864586850115
Feature 237  land: 0.298221810971217
Feature 238  language: 15.776569543312972
Feature 239  law: 0.7680231426833487
Feature 240  lawyer: 5.051295695432925
Feature 241  lead: 3.019627774387962
Feature 242  learn: 1.4760251097460277
Feature 243  lee: 0.8208707351302562
Feature 244  lesson: 1.3931114246486482
Feature 245  let: 7.598915462391124
Feature 246  level: 7.301407766120053
Feature 247  library: 3.5086402143037585
Feature 248  life: 22.251712463774304
Feature 249  light: 12.054325247142252
Feature 250  line: 17.242417771094548
Feature 251  list: 2.2836754373003916
Feature 252  literature: 0.8600987834022014
Feature 253  living: 2.7123772115020817
Feature 254  look: 4.205178024990026
Feature 255  lord: 0.31719576729442694
Feature 256  loss: 1.938900581418351
Feature 257  lot: 43.375487351655515
Feature 258  love: 100.56963012420505
Feature 259  lover: 0.7806211086037842
Feature 260  magic: 2.069978799544711
Feature 261  make: 1.2377651856708325
Feature 262  man: 2.5587012886803135
Feature 263  manner: 0.6476957012562057
Feature 264  mark: 5.994140294759403
Feature 265  market: 1.4388099784563997
Feature 266  marriage: 4.116046043648012
Feature 267  master: 5.300763475700436
Feature 268  material: 11.092971030542984
Feature 269  matter: 1.2001121000870143
Feature 270  meet: 0.5340943235041234
Feature 271  member: 0.7895430872995206
Feature 272  memory: 1.579002106419283
Feature 273  mention: 21.653140966268808
Feature 274  message: 0.6978914204429705
Feature 275  mind: 1.7527369259894916
Feature 276  minute: 4.889452271618072
Feature 277  miss: 2.2678465045439
Feature 278  mom: 2.2549715644292605
Feature 279  moment: 1.3475004647002335
Feature 280  money: 391.9572182609094
Feature 281  month: 3.911542636703569
Feature 282  morning: 2.3189982605974575
Feature 283  mother: 1.7836805430081564
Feature 284  movie: 1.0485022061998426
Feature 285  mr: 3.698907367242268
Feature 286  ms: 1.0218589683961097
Feature 287  murder: 13.778610357529018
Feature 288  music: 1.05158604665497
Feature 289  mystery: 16.988633107062107
Feature 290  narrator: 8.080616948284225
Feature 291  nature: 2.352978038304681
Feature 292  need: 0.6390321160024651
Feature 293  news: 1.0416905830911154
Feature 294  night: 3.215289476995865
Feature 295  note: 2.4929214704504514
Feature 296  novel: 7.748669279405858
Feature 297  number: 3.560170176360214
Feature 298  okay: 125.38621517782987
Feature 299  opinion: 25.99716733729157
Feature 300  opportunity: 0.4980433703727034
Feature 301  order: 0.4232041920456235
Feature 302  pace: 13.766088403720797
Feature 303  page: 9.029048094022064
Feature 304  pain: 1.0280057135879157
Feature 305  paper: 2.5357520189626706
Feature 306  paperback: 6.14815915484159
Feature 307  parker: 4.253088791854451
Feature 308  party: 4.0228513231442085
Feature 309  pass: 13.985869370560744
Feature 310  passion: 1.421303737857733
Feature 311  pay: 40.156544441792526
Feature 312  perfect: 4.600336972581699
Feature 313  period: 2.659145994440589
Feature 314  person: 5.5490505593011985
Feature 315  personality: 7.5876046767034335
Feature 316  perspective: 4.4063640412890415
Feature 317  pick: 3.996453786353713
Feature 318  picture: 1.5018951301207717
Feature 319  piece: 6.098564124594869
Feature 320  place: 7.4312500766929235
Feature 321  plan: 0.6645647632192088
Feature 322  play: 1.6907064853384621
Feature 323  pleasure: 1.7077934237310164
Feature 324  plenty: 0.490226031133757
Feature 325  plot: 167.01971752604467
Feature 326  point: 47.329666289065386
Feature 327  police: 7.331304998110764
Feature 328  potter: 5.632998076021902
Feature 329  power: 3.4129942300049017
Feature 330  practice: 0.6290146126012677
Feature 331  premise: 82.02386080581182
Feature 332  president: 2.563884733487082
Feature 333  price: 4.399216828371701
Feature 334  print: 4.097884613994375
Feature 335  problem: 35.184959474921634
Feature 336  process: 0.370434782954609
Feature 337  product: 8.71225554662897
Feature 338  program: 1.547868627149235
Feature 339  protagonist: 6.689829276160833
Feature 340  publisher: 28.561906995629187
Feature 341  purchase: 4.929876621324719
Feature 342  quality: 7.918557782412444
Feature 343  question: 5.438274252862121
Feature 344  race: 0.25921861646912886
Feature 345  rate: 9.33620509450641
Feature 346  rating: 16.553297755270698
Feature 347  reacher: 1.0366745113514637
Feature 348  read: 82.94247190995985
Feature 349  reader: 6.839430468313358
Feature 350  reading: 2.8372717403197645
Feature 351  reality: 2.2658882913762057
Feature 352  reason: 51.49995813350492
Feature 353  recommend: 3.4396301814048784
Feature 354  reference: 0.9856840666074694
Feature 355  relationship: 8.898432375420754
Feature 356  religion: 2.5397906255709954
Feature 357  reread: 4.171887131419975
Feature 358  research: 1.5291419192924405
Feature 359  resource: 5.3883703407188435
Feature 360  rest: 8.846376612956654
Feature 361  return: 4.772410977261166
Feature 362  review: 9.733043510269393
Feature 363  ride: 3.5598345973990875
Feature 364  right: 0.3933503416659351
Feature 365  road: 1.4002970122716996
Feature 366  robb: 8.493629666065374
Feature 367  robert: 1.799054016894526
Feature 368  role: 1.056157313987686
Feature 369  romance: 8.854504556070866
Feature 370  room: 3.787736755324474
Feature 371  sandford: 1.1111778318951206
Feature 372  scene: 15.213277966259358
Feature 373  school: 0.9923046500106768
Feature 374  science: 2.5754299570302086
Feature 375  search: 0.9672856212656876
Feature 376  seat: 9.284570336729638
Feature 377  section: 4.783256851978422
Feature 378  self: 4.577381189173911
Feature 379  sense: 14.00532097123407
Feature 380  sentence: 12.113485335244976
Feature 381  sequel: 2.6020317967347366
Feature 382  series: 52.73007611041334
Feature 383  service: 5.123867709097989
Feature 384  set: 2.5946968935644965
Feature 385  sex: 86.47882744806633
Feature 386  share: 6.101008398635429
Feature 387  sister: 1.0263561533907375
Feature 388  situation: 7.885632598650062
Feature 389  size: 6.845602301443631
Feature 390  society: 1.8523250834650808
Feature 391  son: 6.083115473121218
Feature 392  sookie: 1.899349350709752
Feature 393  sorry: 81.46345151687007
Feature 394  sort: 28.264086643749916
Feature 395  soul: 3.8226153895872335
Feature 396  space: 1.3014275244507572
Feature 397  spirit: 1.5705583623825825
Feature 398  st: 1.90744244861751
Feature 399  star: 944.367921742259
Feature 400  start: 1.2579779509265512
Feature 401  state: 1.9568214591690416
Feature 402  stay: 2.128567087065834
Feature 403  step: 0.33495010639421
Feature 404  stone: 0.6111292573867334
Feature 405  stop: 2.955331943368757
Feature 406  store: 1.9827692950820197
Feature 407  story: 41.661808636381565
Feature 408  storyline: 9.219785352367277
Feature 409  street: 0.14029151315546554
Feature 410  strength: 2.0999206700913686
Feature 411  student: 2.5903900864646965
Feature 412  study: 0.5481725339973655
Feature 413  stuff: 10.54252496009359
Feature 414  style: 19.823233488610782
Feature 415  success: 0.6291912024868959
Feature 416  summer: 10.840163339433227
Feature 417  support: 1.7216239966957758
Feature 418  sure: 7.811165429257955
Feature 419  surprise: 3.6756738235323336
Feature 420  suspense: 6.093382782086978
Feature 421  tale: 3.9059763475269107
Feature 422  talk: 4.7848751267468215
Feature 423  taste: 17.789990307624958
Feature 424  teacher: 4.4423598761467975
Feature 425  team: 1.6127747851577878
Feature 426  technology: 0.7673237685937748
Feature 427  tell: 3.827147324909086
Feature 428  tension: 3.4782001533160813
Feature 429  text: 3.811045801822055
Feature 430  th: 0.7057786693792049
Feature 431  thank: 34.618304291256116
Feature 432  theme: 8.14064716929614
Feature 433  theory: 4.668353128800584
Feature 434  theyre: 3.5070292582878917
Feature 435  thing: 41.71807805892974
Feature 436  think: 7.010971799682158
Feature 437  thought: 27.534552773034285
Feature 438  thriller: 6.210682588634309
Feature 439  time: 27.51344579687019
Feature 440  title: 18.781681214934878
Feature 441  today: 3.672338566783606
Feature 442  tom: 1.6220329645886231
Feature 443  topic: 6.533690936175153
Feature 444  town: 4.966338539101121
Feature 445  track: 7.276939157908352
Feature 446  train: 13.45252527967539
Feature 447  travel: 1.066116165474029
Feature 448  treat: 2.346575607655255
Feature 449  treatment: 0.7949828247515568
Feature 450  trilogy: 1.5150310884833504
Feature 451  trip: 0.4256737335107678
Feature 452  trouble: 3.999506605422887
Feature 453  trust: 1.4215495921355843
Feature 454  truth: 1.2769664263553593
Feature 455  try: 12.50703813269529
Feature 456  turn: 1.5487107766061539
Feature 457  turner: 7.330172872818294
Feature 458  tv: 3.938588858743635
Feature 459  twist: 6.4444323004505675
Feature 460  type: 15.207758057880639
Feature 461  understand: 3.816501527025926
Feature 462  use: 6.6938665455743065
Feature 463  value: 2.5294226850608066
Feature 464  vampire: 2.7818569776013655
Feature 465  variety: 1.0669360158550174
Feature 466  version: 12.312002014393732
Feature 467  view: 5.774573303721694
Feature 468  violence: 9.274620363449868
Feature 469  visit: 1.8217504938729399
Feature 470  voice: 0.06475143366312666
Feature 471  volume: 0.5070649613398275
Feature 472  wait: 76.23065779916566
Feature 473  want: 0.6585914706758598
Feature 474  war: 5.527028575070255
Feature 475  wasnt: 102.41126077478619
Feature 476  waste: 841.2590539452993
Feature 477  watch: 0.8642055623604503
Feature 478  water: 5.00380700015363
Feature 479  way: 21.485740412074627
Feature 480  week: 1.5748345351934223
Feature 481  wife: 3.149878189387263
Feature 482  winner: 14.97524720230643
Feature 483  wish: 3.1455164043894848
Feature 484  woman: 6.791914068719724
Feature 485  wonder: 3.685162249434058
Feature 486  wont: 1.6716333934536929
Feature 487  word: 6.3527761128763665
Feature 488  work: 7.953389027518221
Feature 489  world: 12.236305357507902
Feature 490  worth: 12.085847023495033
Feature 491  wouldnt: 9.942262097439919
Feature 492  wow: 12.721294645481217
Feature 493  writer: 4.745216291584977
Feature 494  writing: 8.609301903253636
Feature 495  year: 9.603941007523666
Feature 496  york: 6.010871982884292
Feature 497  youll: 2.212784052469771
Feature 498  youre: 2.4830512308336052
Feature 499  youve: 3.511700715966174
Feature 500  verified: 283.7964571805967

Draw words importances by our feature selection method.

By frist 150 words.

In [ ]:
results_df = pd.DataFrame(fscores, index=features_name , columns = ['importance'])
results_df = results_df.sort_values(by=['importance'],ascending=False)
words_importance_plot(results_df.head(100),fig_size=(15,18))

Now, we draw the boxplot.

In [ ]:
words_importance_barplot(results_df)
In [ ]:
build_continuous_features_report(results_df)
Out[ ]:
Count Miss % Card. Min 1st Qrt. Mean Median 3rd Qrt Max Std. Dev.
importance 501 0.0 501 0.057356 1.547869 14.972317 3.4782 8.854505 944.367922 63.490062

The mean importances is 14.97, and max is 944.

We would like to select the high importance words.

We can see that in our POS tagged dataset. There are less important words.

Let's run our find outlier function to find the most important words.

In [ ]:
find_outliers(results_df,'importance')
IQR = 8.854504556070866 - 1.547868627149235 = 7.306635928921631
MAX = 19.814458449453312
Min is 0
Num of min outliers:  0
Num of max outliers:  54
Num of negative outliers:  0
Num of the original data set's whole instance 501
Rate of purged data/total data 0.10778443113772455
Out[ ]:
Index(['star', 'waste', 'money', 'verified', 'didnt', 'disappointment', 'plot',
       'bit', 'okay', 'wasnt', 'love', 'dont', 'character', 'care', 'sex',
       'read', 'premise', 'sorry', 'wait', 'half', 'guess', 'editor', 'kind',
       'idea', 'series', 'reason', 'finish', 'point', 'lot', 'thing', 'story',
       'effort', 'pay', 'fun', 'dialogue', 'doesnt', 'problem', 'thank',
       'book', 'publisher', 'sort', 'lack', 'thought', 'time', 'isnt',
       'opinion', 'heroine', 'chapter', 'life', 'mention', 'way', 'case',
       'author', 'style'],
      dtype='object')

Let's redraw the plot with any importance value larger than 20. By our find outliers function we defined before.

In [ ]:
words_importance_barplot(results_df[results_df['importance'] > 20])

Let's see how many of those words that importance > 20.

In [ ]:
len(results_df[results_df['importance'] > 20])
Out[ ]:
53

53 is a little bit less than we have in Q3, as 67. Let's see how good it works.

Now, we trim our dataset into this 57 features and our target label, scores.

In [ ]:
reduced_only_noun_tdidf_data = only_noun_tdidf_data.loc[:,  results_df[results_df['importance'] > 20].index]
reduced_only_noun_tdidf_data['score'] = only_noun_tdidf_data['score']
reduced_only_noun_tdidf_data.head(5)
Out[ ]:
star waste money verified didnt disappointment plot bit okay wasnt ... isnt opinion heroine chapter life mention way case author score
0 0.0 0.0 0.0 1 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 1
1 0.0 0.0 0.0 1 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.383175 0.0 0.0 0.0 0.0 5
2 0.0 0.0 0.0 0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 5
3 0.0 0.0 0.0 0 0.0 0.254652 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 4
4 0.0 0.0 0.0 0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.595689 0.0 0.000000 0.0 0.0 0.0 0.0 5

5 rows Ă— 54 columns

In [ ]:
reduced_only_noun_tdidf_data.to_csv('reduced_only_noun_tdidf_data.csv')

That will be the dataset, we will use in train and evaluate stage.

4.2.2 Select Evaluation metrics

We use the same evaluation metric as in Q3. Accuarcy.

There is no need to change metrics in this stage.

4.2.3 HyperParameter tuning.

First, we get our training dataset from our reduced feature dataset by our feature selection method.

In [ ]:
reduced_only_noun_tdidf_data = pd.read_csv('/content/drive/MyDrive/A3/reduced_only_noun_tdidf_data.csv')
reduced_only_noun_tdidf_data = reduced_only_noun_tdidf_data.drop('Unnamed: 0', axis=1) # old index is useless now. drop it
reduced_only_noun_tdidf_data = reduced_only_noun_tdidf_data.convert_dtypes()
reduced_only_noun_tdidf_data = reduced_only_noun_tdidf_data.fillna('0')
# reduced_only_noun_tdidf_data.info()
In [ ]:
X_raw, y_raw = get_model_set(reduced_only_noun_tdidf_data)
X_train,X_test, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.2, random_state=42, stratify=y_raw)
X_train,X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

Test train.

In [ ]:
from sklearn.ensemble import RandomForestClassifier
forest_cls = RandomForestClassifier(n_estimators = 50, random_state=42,verbose=1)
forest_cls.fit(X_train, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:   19.3s finished
Out[ ]:
RandomForestClassifier(n_estimators=50, random_state=42, verbose=1)
In [ ]:
from sklearn.metrics import f1_score
churn_prediction = forest_cls.predict(X_valid)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.8s finished
In [ ]:
from sklearn.metrics import classification_report

target_names = ['1', '2','3','4','5']

print("Classification report of the first classifier:\n\n",
      classification_report(y_valid, churn_prediction, target_names=target_names))
Classification report of the first classifier:

               precision    recall  f1-score   support

           1       0.29      0.18      0.23       526
           2       0.12      0.05      0.07       656
           3       0.16      0.07      0.10      1467
           4       0.25      0.15      0.18      3414
           5       0.65      0.83      0.73      9938

    accuracy                           0.56     16001
   macro avg       0.29      0.26      0.26     16001
weighted avg       0.49      0.56      0.51     16001

Not too surprised. Our accuracy reduced a lot.

The idea of choosing only nouns is not a good one.

People use adjectives to express their feeling of good or bad.

If we only select nouns to predict the scores, we will lack of the biggest benchmark of adjectives. And there will be not too many things left.

Such a bad idea.

But, still we can do hyperparamter for better performance.

In [ ]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high = 100),
        'max_features': randint(low=1, high = 8),
    }

forest_cls_rs = RandomForestClassifier(random_state=42, verbose = 1)
rnd_search = RandomizedSearchCV(forest_cls_rs, param_distributions=param_distribs,
                                n_iter=10, cv=5, random_state=42)
rnd_search.fit(X_train, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:   10.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    8.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    8.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    8.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    8.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  52 out of  52 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    2.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    2.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    2.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    2.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    2.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    7.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    7.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    7.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    7.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    7.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    2.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    2.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    2.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    2.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    2.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:   13.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:   15.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:   17.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:   13.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:   22.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  83 out of  83 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:   16.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:   12.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:   11.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:   11.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:   11.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    9.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    9.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    9.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    9.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   11.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    1.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    8.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    5.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    5.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    4.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    6.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    4.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    3.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    2.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    2.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   13.3s finished
Out[ ]:
RandomizedSearchCV(cv=5,
                   estimator=RandomForestClassifier(random_state=42, verbose=1),
                   param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fbd0ad3a190>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fbd0ad126d0>},
                   random_state=42)

Best estimator:

In [ ]:
rnd_search.best_estimator_
Out[ ]:
RandomForestClassifier(max_features=3, n_estimators=88, random_state=42,
                       verbose=1)

Print the list of we tried and sorted the order

In [ ]:
cvres_rnd = rnd_search.cv_results_
for mean_score, params in sorted(zip(cvres_rnd["mean_test_score"], cvres_rnd["params"]), reverse=True):
    print(mean_score, params)
0.5671541371279588 {'max_features': 3, 'n_estimators': 88}
0.5669510243340363 {'max_features': 3, 'n_estimators': 72}
0.5652791932759159 {'max_features': 7, 'n_estimators': 75}
0.5651229505995625 {'max_features': 7, 'n_estimators': 83}
0.5641229457171315 {'max_features': 7, 'n_estimators': 52}
0.5628886573802829 {'max_features': 3, 'n_estimators': 22}
0.5618887171900633 {'max_features': 5, 'n_estimators': 24}
0.5618731202640419 {'max_features': 5, 'n_estimators': 21}
0.5578732435454261 {'max_features': 5, 'n_estimators': 15}
0.4776570507186939 {'max_features': 5, 'n_estimators': 2}

We can see that our best estimator is using RSV: bootstrap with 3 random features among all trees and use totally 88 trees.

4.2.1 Train and evaluate my model.

We will use the same training set and etc in this stage.

In [ ]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Train our model with the best estimators
forest_cls_final = RandomForestClassifier(max_features=3, n_estimators=88, random_state=42,
                       verbose=1)
# train our model
forest_cls_final.fit(X_train, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   13.4s finished
Out[ ]:
RandomForestClassifier(max_features=3, n_estimators=88, random_state=42,
                       verbose=1)
In [ ]:
# testing the result on validation set
pred_final_valid = forest_cls_final.predict(X_valid)

target_names = ['1', '2','3','4','5']

print("Classification report of the final classifier on validation set:\n\n",
      classification_report(y_valid, pred_final_valid, target_names=target_names))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Classification report of the final classifier on validation set:

               precision    recall  f1-score   support

           1       0.30      0.18      0.22       526
           2       0.11      0.05      0.07       656
           3       0.16      0.07      0.10      1467
           4       0.26      0.14      0.18      3414
           5       0.65      0.84      0.73      9938

    accuracy                           0.57     16001
   macro avg       0.30      0.26      0.26     16001
weighted avg       0.49      0.57      0.51     16001

[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    1.2s finished

We can see that our result are actually worse. We do cross validation again on training set.

We will use this result to compare with the one from Q3.

In [ ]:
from sklearn.model_selection import cross_val_score
only_NN_results = cross_val_score(forest_cls_final, X_train, y_train, cv=10)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   14.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   14.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   16.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   12.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   11.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   17.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   11.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   13.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   20.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:   19.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    0.5s finished
Out[ ]:
array([0.56694267, 0.57006718, 0.5690625 , 0.56484375, 0.5684375 ,
       0.56515625, 0.57203125, 0.56859375, 0.5690625 , 0.5678125 ])
In [ ]:
only_NN_results
Out[ ]:
array([0.56694267, 0.57006718, 0.5690625 , 0.56484375, 0.5684375 ,
       0.56515625, 0.57203125, 0.56859375, 0.5690625 , 0.5678125 ])

We will use the result later in section 4.3

Test it on test set.

In [ ]:
# test it on test set
prediction_final_test = forest_cls_final.predict(X_test)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    1.2s finished
In [ ]:
print("Classification report of the final classifier on validation set:\n\n",
      classification_report(y_test, prediction_final_test, target_names=target_names))
Classification report of the final classifier on validation set:

               precision    recall  f1-score   support

           1       0.35      0.19      0.25       658
           2       0.11      0.04      0.06       819
           3       0.17      0.08      0.10      1834
           4       0.26      0.15      0.19      4267
           5       0.65      0.84      0.73     12423

    accuracy                           0.57     20001
   macro avg       0.31      0.26      0.27     20001
weighted avg       0.49      0.57      0.52     20001

The accuracy on test set is 0.57.

4.2.4 Makse sure to not overfit.

The procedure is same as Q3. Since we didn't change our model. And We can draw the learning curve and validation curve of our model to see whether the training score and validation score has departed.

For learning curve:

Since we are using the accuracy evaluation metrics, hence, if the valiation score become very larger and training score still small, then we know our model is overfitted.

We can based on our results to choose our train set size or number of trees in our model.

For validation curve: Although randomforest algroithm is not very easy to overfitted. We still can tune the hyperparameters to make sure it will not overfit. By looking into the validation curve, if along with the grow of number of trees, the train/loss curve has departed from each other. We know that our model become overfitted at that number of trees.

Use these two tools, we can make sure our model will not overfit our data.

4.2.5 Plot a visualization of the learning process or the learned information of the model

4.2.5.1 Learning curve

A learning curve is a graphical representation of the relationship between how proficient people are at a task and the amount of experience they have. Proficiency (measured on the vertical axis) usually increases with increased experience (the horizontal axis), that is to say, the more someone, groups, companies or industries perform a task, the better their performance at the task.Wikipedia

We can find out from the LR curve that whether our model is overfitting/underfitting or whether our model could still imporve by more training epoches. It will help identify our model is good or bad.

set the model as our best estimator and finding the learning curve with scoring = accuracy. Sklearn document

In [ ]:
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import validation_curve

# set the model as our best estimator and finding the learning curve with scoring = accuracy
N, train_lc, val_lc = learning_curve(RandomForestClassifier(max_features=3, n_estimators=88, random_state=42),
                                     X_train, y_train, cv=5,scoring='accuracy',
                                      train_sizes=np.linspace(0.05, 1, 20)) # separate our training size by np.linspace, to 20 pieces
In [ ]:
# learning curve
learning_curve(N, train_lc, val_lc)

4.2.5.2 Validation curve on different trees:

Although we used random search and grid search for finding the most proper hyperparameters, we still want to know that how many trees is good enough and how our model is influnenced by the different number of trees. Like what we did in polynomial regression with different degrees.

In this kind of visualization, we can give a intuitive way of showing the chaning of train/valid score by trees number.

This can help up to use a relatively good enough number of trees when we test our model on different dataset. Since if we best estimator only has 22 trees, we set our limit to 50 trees in this plot.

In [ ]:
from sklearn.model_selection import validation_curve
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
import numpy as np

    
n_estimators = np.arange(1, 50) # limit of number of estimators
train_score, val_score = validation_curve(RandomForestClassifier(max_features=3, n_estimators=88, random_state=42), X_train, y_train,
                                          param_name='n_estimators', param_range=n_estimators, cv=2
                                          , scoring = 'accuracy')
In [ ]:
# draw the validation curve
valid_score_curve(train_score, val_score, n_estimators = np.arange(1, 50))

4.5.2.3 ROC curve and percision recall curve.

  1. AUC-ROC curve is reall good on analyzing classification problem in machine learning. AUC stands for area under the curve and ROC stands for receiver operating characteristics curve. The more AUC area or value is, the better the model we have in classification problem.
  2. And precision - recall curve sepecially good at when we have a unbalanced dataset.

We will reuse the function we defined before in Q3.

ROC Curve

In [ ]:
# draw ROC curve
draw_roc_or_percision_recall_curve(forest_cls_final, y_test, X_test, type='roc')
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    1.2s finished

Percision Recall curve

In [ ]:
# draw percision recall curve
draw_roc_or_percision_recall_curve(forest_cls_final, y_test, X_test, type='pr_c')
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    1.6s finished
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_precision_recall_curve is deprecated; This will be removed in v0.5.0. Please use scikitplot.metrics.plot_precision_recall instead.
  warnings.warn(msg, category=FutureWarning)

4.2.6. Analyze the results.

4.2.6.1 Learning curve and validation curve

First, we analyze the learning curve.

In [ ]:
# learning curve
learning_curve(N, train_lc, val_lc)

We can see that the validation score is not improve by the increasing of training size. Since our dataset is easy to fit, but not easy to separate the prediction between 5 scores.

In [ ]:
print("Classification report of the final classifier on validation set:\n\n",
      classification_report(y_test, prediction_final_test, target_names=target_names))
Classification report of the final classifier on validation set:

               precision    recall  f1-score   support

           1       0.35      0.19      0.25       658
           2       0.11      0.04      0.06       819
           3       0.17      0.08      0.10      1834
           4       0.26      0.15      0.19      4267
           5       0.65      0.84      0.73     12423

    accuracy                           0.57     20001
   macro avg       0.31      0.26      0.27     20001
weighted avg       0.49      0.57      0.52     20001

If we take a look at the classification score again, we would find that the most low scores are from predicted score 1,2,3,4.

Even if we used the stratified sampling on the original dataset. The thing is scores from 1 to 4 are not predicting by two reasons. Not too many people gave those scores more often. We saw in data understanding part, people more likely to give high scores on product. That make things hard to predict by only their words.

In [ ]:
# draw the validation curve
valid_score_curve(train_score, val_score, n_estimators = np.arange(1, 50))

And the validation curve shows the same pattern with Q3's model.

We will not spend too much time on this. Because we already analyze the result from Q3.

The model start to get stable after 5 or 6 trees. That means the model is very easy to fit the dataset. But hard to improve. We probably need more words in it. Like adjectives!

4.2.6.2 ROC curve

We resue the function we defined before.

In [ ]:
# draw ROC curve
draw_roc_or_percision_recall_curve(forest_cls_final, y_test, X_test, type='roc')
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    1.2s finished

We can see that class 1 still has the lowest value of AUC. But this time, the more classes' AUC curve are dropping or has lower value than we have from the original dataset.

We can see that only nouns indeed has worse performance.

4.6.2.3 Percision-Recall graph

In [ ]:
print("Classification report of the final classifier on validation set:\n\n",
      classification_report(y_test, prediction_final_test, target_names=target_names))
Classification report of the final classifier on validation set:

               precision    recall  f1-score   support

           1       0.35      0.19      0.25       658
           2       0.11      0.04      0.06       819
           3       0.17      0.08      0.10      1834
           4       0.26      0.15      0.19      4267
           5       0.65      0.84      0.73     12423

    accuracy                           0.57     20001
   macro avg       0.31      0.26      0.27     20001
weighted avg       0.49      0.57      0.52     20001

We use our previous defined function to draw this graph.

In [ ]:
# draw ROC curve
draw_roc_or_percision_recall_curve(forest_cls_final, y_test, X_test, type='pr_c')
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  88 out of  88 | elapsed:    1.2s finished
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_precision_recall_curve is deprecated; This will be removed in v0.5.0. Please use scikitplot.metrics.plot_precision_recall instead.
  warnings.warn(msg, category=FutureWarning)

This time, we can see class 5 has lower precision than in the original dataset. And other 4 classes have the similar performance with the orignial dataset.

4.3 statistical significance test with Kolmogorov-Smirnov statistic

4.3.1 Boxplot between two results.

Now we create a new dataFrame to save the original result and our only nouns results.

We take another look at the original results we get from Q3. Since the random seed are the same. The training data set is exactly the same between our two test.

Hence, there is no need to retrain our model from Q3.

First we retreive our original result from Q3.

In [ ]:
original_result
Out[ ]:
array([0.62990158, 0.63162006, 0.623125  , 0.62140625, 0.62265625,
       0.61984375, 0.6246875 , 0.61671875, 0.62640625, 0.62375   ])

Then retreive our only nouns results

In [ ]:
only_NN_results 
Out[ ]:
array([0.56694267, 0.57006718, 0.5690625 , 0.56484375, 0.5684375 ,
       0.56515625, 0.57203125, 0.56859375, 0.5690625 , 0.5678125 ])

Save two results into a new dataframe

In [ ]:
result = pd.DataFrame()
result['original'] = pd.DataFrame(original_result)
result['only_nn']  = pd.DataFrame(only_NN_results)
In [ ]:
result
Out[ ]:
original only_nn
0 0.629902 0.566943
1 0.631620 0.570067
2 0.623125 0.569063
3 0.621406 0.564844
4 0.622656 0.568438
5 0.619844 0.565156
6 0.624687 0.572031
7 0.616719 0.568594
8 0.626406 0.569063
9 0.623750 0.567813

Draw the boxplot.

In [ ]:
figure(figsize=(8, 6), dpi=100)
ax = sns.boxplot(data=result)
plt.ylabel('Accuracy')
plt.title('Boxplot of Accuracy between original set and only nouns set')
Out[ ]:
Text(0.5, 1.0, 'Boxplot of Accuracy between original set and only nouns set')

We can see the only nouns dataset is much worse than the original dataset.

But we need the significance test to prove our hypothesis.

4.3.2 Significance test

First, let's what our results' distributions look like. We would like to find our whether they are normal distributions.

In [ ]:
result.hist()
Out[ ]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fc78b2c6c10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc78b26add0>]],
      dtype=object)

From the above image, we can clearly see that they are not normal distribution.

Hence We can not use t-test. For nonparameteric distribution, we use Kolmogorov-Smirnov statistic test.

In [ ]:
from scipy.stats import ks_2samp
#---------------ks_2samp_test------------------
def ks_2samp_test(data, param1='randomForest', param2='NeuralNetwork'):
	max_print_out(True)
  # get the siginificance test results
	value, pvalue = ks_2samp(data[param1].values,data[param2].values)
	print(value, pvalue)
  # if the pvalue larget than 0.05
	if pvalue > 0.05:
		print('Samples are likely drawn from the same distributions (fail to reject H0)')
	else:
		print(' Samples are likely drawn from different distributions (reject H0)')
In [ ]:
ks_2samp_test(result, 'original', 'only_nn')
1.0 1.0825088224469026e-05
 Samples are likely drawn from different distributions (reject H0)

Now, we can see that those two results are from two different distribution. Since our H0 is rejected.

We take another look at the boxplot again.

In [ ]:
figure(figsize=(8, 6), dpi=100)
result.boxplot()
plt.ylabel('Accuracy')
plt.title('Boxplot of Accuracy between original set and only_nn set')
Out[ ]:
Text(0.5, 1.0, 'Boxplot of Accuracy between original set and only_nn set')

Now, we could say that the original dataset testing accuaracy is much better than the only nouns dataset's result.

The reason is very clearly actually.

When a person would rate some product and write his/her review.

He will not use nouns to express his feeling about good or bad.

That's simply impossible.

People often use adjective to express their feeling.

Like GOOD, GREAT, BAD, Remarkable, etc.

When we only extract nouns, it's basically delete all the useful information to predict the score.

I am superised we can get 61% accuaracy.

It's really worth to try to save all adjective words, instead of nouns.

4.4 Further investigation

Now we create a new dataset with extracting adjectives words.

Which is "JJ" in our case.

4.4.1 Get a new dataset with only adjectives

Now we get our dataset with only adjectives.

First we retrive our dataset from pkl

In [ ]:
# first we retrive this dataset
model_data_raw = joblib.load('/content/drive/MyDrive/A3/pkls/pos_data_raw.pkl')
In [ ]:
only_adj_data = model_data_raw.copy()
only_adj_data = only_adj_data.reset_index()
only_adj_data = only_adj_data.drop('index',axis = 1)
In [ ]:
# create a new column called only_noun with empty values
only_adj_data['only_adj'] = pd.NA
In [ ]:
only_adj_data.head()
Out[ ]:
verified text score POS_text only_adj
744031 1 sigh miss claire jamie vague characters wasted... 1 [(sigh, JJ), (miss, NN), (claire, NN), (jamie,... <NA>
801184 1 remarkable book many levels thoroughly enjoyed... 5 [(remarkable, JJ), (book, NN), (many, JJ), (le... <NA>
341256 0 buy book laurel hardy fans love book certainly... 5 [(buy, VB), (book, NN), (laurel, NN), (hardy, ... <NA>
734969 0 enthralling suspense readers familiar ms coult... 4 [(enthralling, VBG), (suspense, NN), (readers,... <NA>
750145 0 great story fantastic illustrations cupidandps... 5 [(great, JJ), (story, NN), (fantastic, JJ), (i... <NA>

Now we get all words are belong to adjectives.

In [ ]:
# use for loop to get all pos_list
from tqdm.notebook import tqdm
for i in tqdm(range(len(only_adj_data))):
  # get one instances from POS_text by iloc with i, 3
  text_list = only_adj_data.iloc[i, 3]
  # get a empty set for saving adj words
  adj = []
  # iterate over the text_list
  for words, wtype in text_list:
    if wtype == 'JJ': # if it's a noun
      adj.append(words) # add words to our noun array
  
  # end for
  # we save our nouns extraction to our new column 
  # first we convert our list to string then save the string back to the dataset.
  str1 = " "
  adj_str = str1.join(adj)
  only_adj_data.iloc[i, 4 ] = adj_str
In [ ]:
only_adj_data['only_adj'] = only_adj_data.pop('only_noun')

Now, we print the new dataset's head

In [ ]:
only_adj_data.head()
Out[ ]:
verified text score POS_text only_adj
0 1 sigh miss claire jamie vague characters wasted... 1 [(sigh, JJ), (miss, NN), (claire, NN), (jamie,... sigh vague mthis overcast disappointed
1 1 remarkable book many levels thoroughly enjoyed... 5 [(remarkable, JJ), (book, NN), (many, JJ), (le... remarkable many old afraid depressing boy old ...
2 0 buy book laurel hardy fans love book certainly... 5 [(buy, VB), (book, NN), (laurel, NN), (hardy, ... hardy warmth stan himi wish surprised current
3 0 enthralling suspense readers familiar ms coult... 4 [(enthralling, VBG), (suspense, NN), (readers,... familiar ms savich confessional san francisco ...
4 0 great story fantastic illustrations cupidandps... 5 [(great, JJ), (story, NN), (fantastic, JJ), (i... great fantastic faith p incredible little
In [ ]:
joblib.dump(only_adj_data,'only_adj_data.pkl')
Out[ ]:
['only_adj_tdidf_data.pkl']

We can see that there is only adjectives left now.

4.4.2 Gets our TFIDF vector

In [ ]:
only_adj_tdidf_data = TDIDF_Data_generator_pos(only_adj_data, max_features= 500,feature_name = 'only_adj')
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

Get our training, validation and testing dataset.

In [ ]:
only_adj_tdidf_data.head()
Out[ ]:
able accurate actual adorable afraid agree alex alive amazing amazon american ancient angry animal anxious appreciate appropriate arent audio authentic author available average aware awesome ... wanted warm wasnt weak weird wellwritten western white wide wild willing wish witty wonderful wont worth worthy wouldnt wow write wrong youll young verified score
0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 1
1 0.00 0.00 0.00 0.00 0.41 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 5
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 5
3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 4
4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 5

5 rows Ă— 502 columns

In [ ]:
X_raw, y_raw = get_model_set(only_adj_tdidf_data)
X_train,X_test, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.2, random_state=42, stratify=y_raw)
X_train,X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

Pretrain our model

In [ ]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Train our model
forest_cls_final = RandomForestClassifier(max_features=3, n_estimators = 72, random_state=42,verbose=3)
# train our model
forest_cls_final.fit(X_train, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
building tree 1 of 72
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
building tree 2 of 72
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.8s remaining:    0.0s
building tree 3 of 72
building tree 4 of 72
building tree 5 of 72
building tree 6 of 72
building tree 7 of 72
building tree 8 of 72
building tree 9 of 72
building tree 10 of 72
building tree 11 of 72
building tree 12 of 72
building tree 13 of 72
building tree 14 of 72
building tree 15 of 72
building tree 16 of 72
building tree 17 of 72
building tree 18 of 72
building tree 19 of 72
building tree 20 of 72
building tree 21 of 72
building tree 22 of 72
building tree 23 of 72
building tree 24 of 72
building tree 25 of 72
building tree 26 of 72
building tree 27 of 72
building tree 28 of 72
building tree 29 of 72
building tree 30 of 72
building tree 31 of 72
building tree 32 of 72
building tree 33 of 72
building tree 34 of 72
building tree 35 of 72
building tree 36 of 72
building tree 37 of 72
building tree 38 of 72
building tree 39 of 72
building tree 40 of 72
building tree 41 of 72
building tree 42 of 72
building tree 43 of 72
building tree 44 of 72
building tree 45 of 72
building tree 46 of 72
building tree 47 of 72
building tree 48 of 72
building tree 49 of 72
building tree 50 of 72
building tree 51 of 72
building tree 52 of 72
building tree 53 of 72
building tree 54 of 72
building tree 55 of 72
building tree 56 of 72
building tree 57 of 72
building tree 58 of 72
building tree 59 of 72
building tree 60 of 72
building tree 61 of 72
building tree 62 of 72
building tree 63 of 72
building tree 64 of 72
building tree 65 of 72
building tree 66 of 72
building tree 67 of 72
building tree 68 of 72
building tree 69 of 72
building tree 70 of 72
building tree 71 of 72
building tree 72 of 72
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:   51.4s finished
Out[ ]:
RandomForestClassifier(max_features=3, n_estimators=72, random_state=42,
                       verbose=3)
In [ ]:
pred_final_valid = forest_cls_final.predict(X_valid)

target_names = ['1', '2','3','4','5']

print("Classification report of the final classifier on validation set:\n\n",
      classification_report(y_valid, pred_final_valid, target_names=target_names))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s
Classification report of the final classifier on validation set:

               precision    recall  f1-score   support

           1       0.34      0.08      0.12       526
           2       0.22      0.03      0.05       656
           3       0.37      0.11      0.17      1467
           4       0.37      0.13      0.19      3414
           5       0.66      0.94      0.78      9938

    accuracy                           0.63     16001
   macro avg       0.39      0.26      0.26     16001
weighted avg       0.54      0.63      0.54     16001

[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    2.5s finished

Get our cross validation score on training set

In [ ]:
from sklearn.model_selection import cross_val_score
only_adj_results = cross_val_score(forest_cls_final, X_train, y_train, cv=10)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
building tree 1 of 72
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
building tree 2 of 72
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s
building tree 3 of 72
building tree 4 of 72
building tree 5 of 72
building tree 6 of 72
building tree 7 of 72
building tree 8 of 72
building tree 9 of 72
building tree 10 of 72
building tree 11 of 72
building tree 12 of 72
building tree 13 of 72
building tree 14 of 72
building tree 15 of 72
building tree 16 of 72
building tree 17 of 72
building tree 18 of 72
building tree 19 of 72
building tree 20 of 72
building tree 21 of 72
building tree 22 of 72
building tree 23 of 72
building tree 24 of 72
building tree 25 of 72
building tree 26 of 72
building tree 27 of 72
building tree 28 of 72
building tree 29 of 72
building tree 30 of 72
building tree 31 of 72
building tree 32 of 72
building tree 33 of 72
building tree 34 of 72
building tree 35 of 72
building tree 36 of 72
building tree 37 of 72
building tree 38 of 72
building tree 39 of 72
building tree 40 of 72
building tree 41 of 72
building tree 42 of 72
building tree 43 of 72
building tree 44 of 72
building tree 45 of 72
building tree 46 of 72
building tree 47 of 72
building tree 48 of 72
building tree 49 of 72
building tree 50 of 72
building tree 51 of 72
building tree 52 of 72
building tree 53 of 72
building tree 54 of 72
building tree 55 of 72
building tree 56 of 72
building tree 57 of 72
building tree 58 of 72
building tree 59 of 72
building tree 60 of 72
building tree 61 of 72
building tree 62 of 72
building tree 63 of 72
building tree 64 of 72
building tree 65 of 72
building tree 66 of 72
building tree 67 of 72
building tree 68 of 72
building tree 69 of 72
building tree 70 of 72
building tree 71 of 72
building tree 72 of 72
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:   34.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    1.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
building tree 1 of 72
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
building tree 2 of 72
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s
building tree 3 of 72
building tree 4 of 72
building tree 5 of 72
building tree 6 of 72
building tree 7 of 72
building tree 8 of 72
building tree 9 of 72
building tree 10 of 72
building tree 11 of 72
building tree 12 of 72
building tree 13 of 72
building tree 14 of 72
building tree 15 of 72
building tree 16 of 72
building tree 17 of 72
building tree 18 of 72
building tree 19 of 72
building tree 20 of 72
building tree 21 of 72
building tree 22 of 72
building tree 23 of 72
building tree 24 of 72
building tree 25 of 72
building tree 26 of 72
building tree 27 of 72
building tree 28 of 72
building tree 29 of 72
building tree 30 of 72
building tree 31 of 72
building tree 32 of 72
building tree 33 of 72
building tree 34 of 72
building tree 35 of 72
building tree 36 of 72
building tree 37 of 72
building tree 38 of 72
building tree 39 of 72
building tree 40 of 72
building tree 41 of 72
building tree 42 of 72
building tree 43 of 72
building tree 44 of 72
building tree 45 of 72
building tree 46 of 72
building tree 47 of 72
building tree 48 of 72
building tree 49 of 72
building tree 50 of 72
building tree 51 of 72
building tree 52 of 72
building tree 53 of 72
building tree 54 of 72
building tree 55 of 72
building tree 56 of 72
building tree 57 of 72
building tree 58 of 72
building tree 59 of 72
building tree 60 of 72
building tree 61 of 72
building tree 62 of 72
building tree 63 of 72
building tree 64 of 72
building tree 65 of 72
building tree 66 of 72
building tree 67 of 72
building tree 68 of 72
building tree 69 of 72
building tree 70 of 72
building tree 71 of 72
building tree 72 of 72
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:   34.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    1.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
building tree 1 of 72
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
building tree 2 of 72
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.9s remaining:    0.0s
building tree 3 of 72
building tree 4 of 72
building tree 5 of 72
building tree 6 of 72
building tree 7 of 72
building tree 8 of 72
building tree 9 of 72
building tree 10 of 72
building tree 11 of 72
building tree 12 of 72
building tree 13 of 72
building tree 14 of 72
building tree 15 of 72
building tree 16 of 72
building tree 17 of 72
building tree 18 of 72
building tree 19 of 72
building tree 20 of 72
building tree 21 of 72
building tree 22 of 72
building tree 23 of 72
building tree 24 of 72
building tree 25 of 72
building tree 26 of 72
building tree 27 of 72
building tree 28 of 72
building tree 29 of 72
building tree 30 of 72
building tree 31 of 72
building tree 32 of 72
building tree 33 of 72
building tree 34 of 72
building tree 35 of 72
building tree 36 of 72
building tree 37 of 72
building tree 38 of 72
building tree 39 of 72
building tree 40 of 72
building tree 41 of 72
building tree 42 of 72
building tree 43 of 72
building tree 44 of 72
building tree 45 of 72
building tree 46 of 72
building tree 47 of 72
building tree 48 of 72
building tree 49 of 72
building tree 50 of 72
building tree 51 of 72
building tree 52 of 72
building tree 53 of 72
building tree 54 of 72
building tree 55 of 72
building tree 56 of 72
building tree 57 of 72
building tree 58 of 72
building tree 59 of 72
building tree 60 of 72
building tree 61 of 72
building tree 62 of 72
building tree 63 of 72
building tree 64 of 72
building tree 65 of 72
building tree 66 of 72
building tree 67 of 72
building tree 68 of 72
building tree 69 of 72
building tree 70 of 72
building tree 71 of 72
building tree 72 of 72
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:   34.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
building tree 1 of 72
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
building tree 2 of 72
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s
building tree 3 of 72
building tree 4 of 72
building tree 5 of 72
building tree 6 of 72
building tree 7 of 72
building tree 8 of 72
building tree 9 of 72
building tree 10 of 72
building tree 11 of 72
building tree 12 of 72
building tree 13 of 72
building tree 14 of 72
building tree 15 of 72
building tree 16 of 72
building tree 17 of 72
building tree 18 of 72
building tree 19 of 72
building tree 20 of 72
building tree 21 of 72
building tree 22 of 72
building tree 23 of 72
building tree 24 of 72
building tree 25 of 72
building tree 26 of 72
building tree 27 of 72
building tree 28 of 72
building tree 29 of 72
building tree 30 of 72
building tree 31 of 72
building tree 32 of 72
building tree 33 of 72
building tree 34 of 72
building tree 35 of 72
building tree 36 of 72
building tree 37 of 72
building tree 38 of 72
building tree 39 of 72
building tree 40 of 72
building tree 41 of 72
building tree 42 of 72
building tree 43 of 72
building tree 44 of 72
building tree 45 of 72
building tree 46 of 72
building tree 47 of 72
building tree 48 of 72
building tree 49 of 72
building tree 50 of 72
building tree 51 of 72
building tree 52 of 72
building tree 53 of 72
building tree 54 of 72
building tree 55 of 72
building tree 56 of 72
building tree 57 of 72
building tree 58 of 72
building tree 59 of 72
building tree 60 of 72
building tree 61 of 72
building tree 62 of 72
building tree 63 of 72
building tree 64 of 72
building tree 65 of 72
building tree 66 of 72
building tree 67 of 72
building tree 68 of 72
building tree 69 of 72
building tree 70 of 72
building tree 71 of 72
building tree 72 of 72
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:   34.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
building tree 1 of 72
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
building tree 2 of 72
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.9s remaining:    0.0s
building tree 3 of 72
building tree 4 of 72
building tree 5 of 72
building tree 6 of 72
building tree 7 of 72
building tree 8 of 72
building tree 9 of 72
building tree 10 of 72
building tree 11 of 72
building tree 12 of 72
building tree 13 of 72
building tree 14 of 72
building tree 15 of 72
building tree 16 of 72
building tree 17 of 72
building tree 18 of 72
building tree 19 of 72
building tree 20 of 72
building tree 21 of 72
building tree 22 of 72
building tree 23 of 72
building tree 24 of 72
building tree 25 of 72
building tree 26 of 72
building tree 27 of 72
building tree 28 of 72
building tree 29 of 72
building tree 30 of 72
building tree 31 of 72
building tree 32 of 72
building tree 33 of 72
building tree 34 of 72
building tree 35 of 72
building tree 36 of 72
building tree 37 of 72
building tree 38 of 72
building tree 39 of 72
building tree 40 of 72
building tree 41 of 72
building tree 42 of 72
building tree 43 of 72
building tree 44 of 72
building tree 45 of 72
building tree 46 of 72
building tree 47 of 72
building tree 48 of 72
building tree 49 of 72
building tree 50 of 72
building tree 51 of 72
building tree 52 of 72
building tree 53 of 72
building tree 54 of 72
building tree 55 of 72
building tree 56 of 72
building tree 57 of 72
building tree 58 of 72
building tree 59 of 72
building tree 60 of 72
building tree 61 of 72
building tree 62 of 72
building tree 63 of 72
building tree 64 of 72
building tree 65 of 72
building tree 66 of 72
building tree 67 of 72
building tree 68 of 72
building tree 69 of 72
building tree 70 of 72
building tree 71 of 72
building tree 72 of 72
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:   34.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
building tree 1 of 72
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
building tree 2 of 72
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s
building tree 3 of 72
building tree 4 of 72
building tree 5 of 72
building tree 6 of 72
building tree 7 of 72
building tree 8 of 72
building tree 9 of 72
building tree 10 of 72
building tree 11 of 72
building tree 12 of 72
building tree 13 of 72
building tree 14 of 72
building tree 15 of 72
building tree 16 of 72
building tree 17 of 72
building tree 18 of 72
building tree 19 of 72
building tree 20 of 72
building tree 21 of 72
building tree 22 of 72
building tree 23 of 72
building tree 24 of 72
building tree 25 of 72
building tree 26 of 72
building tree 27 of 72
building tree 28 of 72
building tree 29 of 72
building tree 30 of 72
building tree 31 of 72
building tree 32 of 72
building tree 33 of 72
building tree 34 of 72
building tree 35 of 72
building tree 36 of 72
building tree 37 of 72
building tree 38 of 72
building tree 39 of 72
building tree 40 of 72
building tree 41 of 72
building tree 42 of 72
building tree 43 of 72
building tree 44 of 72
building tree 45 of 72
building tree 46 of 72
building tree 47 of 72
building tree 48 of 72
building tree 49 of 72
building tree 50 of 72
building tree 51 of 72
building tree 52 of 72
building tree 53 of 72
building tree 54 of 72
building tree 55 of 72
building tree 56 of 72
building tree 57 of 72
building tree 58 of 72
building tree 59 of 72
building tree 60 of 72
building tree 61 of 72
building tree 62 of 72
building tree 63 of 72
building tree 64 of 72
building tree 65 of 72
building tree 66 of 72
building tree 67 of 72
building tree 68 of 72
building tree 69 of 72
building tree 70 of 72
building tree 71 of 72
building tree 72 of 72
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:   35.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
building tree 1 of 72
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
building tree 2 of 72
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s
building tree 3 of 72
building tree 4 of 72
building tree 5 of 72
building tree 6 of 72
building tree 7 of 72
building tree 8 of 72
building tree 9 of 72
building tree 10 of 72
building tree 11 of 72
building tree 12 of 72
building tree 13 of 72
building tree 14 of 72
building tree 15 of 72
building tree 16 of 72
building tree 17 of 72
building tree 18 of 72
building tree 19 of 72
building tree 20 of 72
building tree 21 of 72
building tree 22 of 72
building tree 23 of 72
building tree 24 of 72
building tree 25 of 72
building tree 26 of 72
building tree 27 of 72
building tree 28 of 72
building tree 29 of 72
building tree 30 of 72
building tree 31 of 72
building tree 32 of 72
building tree 33 of 72
building tree 34 of 72
building tree 35 of 72
building tree 36 of 72
building tree 37 of 72
building tree 38 of 72
building tree 39 of 72
building tree 40 of 72
building tree 41 of 72
building tree 42 of 72
building tree 43 of 72
building tree 44 of 72
building tree 45 of 72
building tree 46 of 72
building tree 47 of 72
building tree 48 of 72
building tree 49 of 72
building tree 50 of 72
building tree 51 of 72
building tree 52 of 72
building tree 53 of 72
building tree 54 of 72
building tree 55 of 72
building tree 56 of 72
building tree 57 of 72
building tree 58 of 72
building tree 59 of 72
building tree 60 of 72
building tree 61 of 72
building tree 62 of 72
building tree 63 of 72
building tree 64 of 72
building tree 65 of 72
building tree 66 of 72
building tree 67 of 72
building tree 68 of 72
building tree 69 of 72
building tree 70 of 72
building tree 71 of 72
building tree 72 of 72
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:   35.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
building tree 1 of 72
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
building tree 2 of 72
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.9s remaining:    0.0s
building tree 3 of 72
building tree 4 of 72
building tree 5 of 72
building tree 6 of 72
building tree 7 of 72
building tree 8 of 72
building tree 9 of 72
building tree 10 of 72
building tree 11 of 72
building tree 12 of 72
building tree 13 of 72
building tree 14 of 72
building tree 15 of 72
building tree 16 of 72
building tree 17 of 72
building tree 18 of 72
building tree 19 of 72
building tree 20 of 72
building tree 21 of 72
building tree 22 of 72
building tree 23 of 72
building tree 24 of 72
building tree 25 of 72
building tree 26 of 72
building tree 27 of 72
building tree 28 of 72
building tree 29 of 72
building tree 30 of 72
building tree 31 of 72
building tree 32 of 72
building tree 33 of 72
building tree 34 of 72
building tree 35 of 72
building tree 36 of 72
building tree 37 of 72
building tree 38 of 72
building tree 39 of 72
building tree 40 of 72
building tree 41 of 72
building tree 42 of 72
building tree 43 of 72
building tree 44 of 72
building tree 45 of 72
building tree 46 of 72
building tree 47 of 72
building tree 48 of 72
building tree 49 of 72
building tree 50 of 72
building tree 51 of 72
building tree 52 of 72
building tree 53 of 72
building tree 54 of 72
building tree 55 of 72
building tree 56 of 72
building tree 57 of 72
building tree 58 of 72
building tree 59 of 72
building tree 60 of 72
building tree 61 of 72
building tree 62 of 72
building tree 63 of 72
building tree 64 of 72
building tree 65 of 72
building tree 66 of 72
building tree 67 of 72
building tree 68 of 72
building tree 69 of 72
building tree 70 of 72
building tree 71 of 72
building tree 72 of 72
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:   34.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
building tree 1 of 72
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
building tree 2 of 72
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s
building tree 3 of 72
building tree 4 of 72
building tree 5 of 72
building tree 6 of 72
building tree 7 of 72
building tree 8 of 72
building tree 9 of 72
building tree 10 of 72
building tree 11 of 72
building tree 12 of 72
building tree 13 of 72
building tree 14 of 72
building tree 15 of 72
building tree 16 of 72
building tree 17 of 72
building tree 18 of 72
building tree 19 of 72
building tree 20 of 72
building tree 21 of 72
building tree 22 of 72
building tree 23 of 72
building tree 24 of 72
building tree 25 of 72
building tree 26 of 72
building tree 27 of 72
building tree 28 of 72
building tree 29 of 72
building tree 30 of 72
building tree 31 of 72
building tree 32 of 72
building tree 33 of 72
building tree 34 of 72
building tree 35 of 72
building tree 36 of 72
building tree 37 of 72
building tree 38 of 72
building tree 39 of 72
building tree 40 of 72
building tree 41 of 72
building tree 42 of 72
building tree 43 of 72
building tree 44 of 72
building tree 45 of 72
building tree 46 of 72
building tree 47 of 72
building tree 48 of 72
building tree 49 of 72
building tree 50 of 72
building tree 51 of 72
building tree 52 of 72
building tree 53 of 72
building tree 54 of 72
building tree 55 of 72
building tree 56 of 72
building tree 57 of 72
building tree 58 of 72
building tree 59 of 72
building tree 60 of 72
building tree 61 of 72
building tree 62 of 72
building tree 63 of 72
building tree 64 of 72
building tree 65 of 72
building tree 66 of 72
building tree 67 of 72
building tree 68 of 72
building tree 69 of 72
building tree 70 of 72
building tree 71 of 72
building tree 72 of 72
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:   34.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
building tree 1 of 72
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
building tree 2 of 72
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.9s remaining:    0.0s
building tree 3 of 72
building tree 4 of 72
building tree 5 of 72
building tree 6 of 72
building tree 7 of 72
building tree 8 of 72
building tree 9 of 72
building tree 10 of 72
building tree 11 of 72
building tree 12 of 72
building tree 13 of 72
building tree 14 of 72
building tree 15 of 72
building tree 16 of 72
building tree 17 of 72
building tree 18 of 72
building tree 19 of 72
building tree 20 of 72
building tree 21 of 72
building tree 22 of 72
building tree 23 of 72
building tree 24 of 72
building tree 25 of 72
building tree 26 of 72
building tree 27 of 72
building tree 28 of 72
building tree 29 of 72
building tree 30 of 72
building tree 31 of 72
building tree 32 of 72
building tree 33 of 72
building tree 34 of 72
building tree 35 of 72
building tree 36 of 72
building tree 37 of 72
building tree 38 of 72
building tree 39 of 72
building tree 40 of 72
building tree 41 of 72
building tree 42 of 72
building tree 43 of 72
building tree 44 of 72
building tree 45 of 72
building tree 46 of 72
building tree 47 of 72
building tree 48 of 72
building tree 49 of 72
building tree 50 of 72
building tree 51 of 72
building tree 52 of 72
building tree 53 of 72
building tree 54 of 72
building tree 55 of 72
building tree 56 of 72
building tree 57 of 72
building tree 58 of 72
building tree 59 of 72
building tree 60 of 72
building tree 61 of 72
building tree 62 of 72
building tree 63 of 72
building tree 64 of 72
building tree 65 of 72
building tree 66 of 72
building tree 67 of 72
building tree 68 of 72
building tree 69 of 72
building tree 70 of 72
building tree 71 of 72
building tree 72 of 72
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:   34.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:    0.9s finished
In [ ]:
only_adj_results
Out[ ]:
[0.62568349,
 0.6300578,
 0.624375,
 0.6296875,
 0.62578125,
 0.6253125,
 0.6275,
 0.6265625,
 0.6246875,
 0.62515625]

Save to our result dataFrame

In [ ]:
result['only_adj'] = pd.DataFrame(only_adj_results)

Draw the boxplot

In [ ]:
figure(figsize=(8, 6), dpi=100)
result.boxplot()
plt.ylabel('Accuracy')
plt.title('Boxplot of Accuracy between original set, only noun and only adjectives set')
Out[ ]:
Text(0.5, 1.0, 'Boxplot of Accuracy between original set, only noun and only adjectives set')

We can see that our dataset with only adjectives are much better than the other two. It has higher average scores and small range of minimun and maximum.

In [ ]:
ks_2samp_test(result, 'original', 'only_adj')
0.6 0.05244755244755244
Samples are likely drawn from the same distributions (fail to reject H0)

Although we can't pass the significance test. But if we have more data, I would say only adjective words are our best choice. Since, we use less data, but we got a very similar results with the original data. And we have higher minimun accuracy value than the original data's result.

It means our adjectives model is more stable.

Reference

  1. Stack Overflow. (n.d.). Removing html tags in pandas. https://stackoverflow.com/questions/45999415/removing-html-tags-in-pandas

  2. Stack Overflow. (n.d.). Python remove stop words from pandas dataframe. https://stackoverflow.com/questions/29523254/python-remove-stop-words-from-pandas-dataframe

  3. MasterClass staff. (2022, February 25). Mass Market Paperbacks: 5 Parts of a Mass Market Paperback. MasterClass. https://www.masterclass.com/articles/mass-market-paperback

  4. Spark by Examples. (n.d.). Pandas Combine Two Columns of Text in DataFrame. https://sparkbyexamples.com/pandas/pandas-combine-two-columns-of-text-in-dataframe/

  5. Python Software Foundation. (n.d.). Date time - Basic date and time types. Python. https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

  6. Hmghaly. (2013, October 7). Sklearn plot confusion matrix with labels. Stackoverflow. https://stackoverflow.com/questions/19233771/sklearn-plot-confusion-matrix-with-labels

  7. RDocumentation. (n.d.). Prop.table: Express Table Entries as Fraction of Marginal Table. RDocumentation. https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/prop.table

  8. Learning Curve. (n.d.). In Wikipedia. Retrieved July 3, 2022, from https://en.wikipedia.org/wiki/Learning_curve

  9. Scikit learn. (n.d.). Sklearn.svm.SVC. Scikit learn. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

  10. Early stopping. (n.d.). In Wikipedia. Retrieved July 3, 2022, from https://en.wikipedia.org/wiki/Early_stopping

  11. Kassambara, A. (2018). Machine Learning Essentials: Practical Guide in R. CreateSpace Independent Publishing Platform. https://www.datanovia.com/en/product/machine-learning-essentials-practical-guide-in-r/?url=/5-bookadvisor/54-machine-learning-essentials/

  12. Learning Curve. (n.d.). In Wikipedia. Retrieved June 22, 2022, from https://en.wikipedia.org/wiki/Learning_curve

  13. Early stopping. (n.d.). In Wikipedia. Retrieved June 22, 2022, from https://en.wikipedia.org/wiki/Early_stopping

  14. Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. O'Reilly Media, Inc.